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DNA Synthesis and Genome Engineering 


Competition and the Future of Reading 
and Writing DNA 


Robert Carlson 


Biodesic and Bioeconomy Capital, 3417 Evanston Ave N, Ste 329, Seattle, WA 98103, USA 


Constructing arbitrary genetic instruction sets is a core technology for biological 
engineering. Biologists and engineers are pursuing even better methods to 
assemble these arbitrary sequences from synthetic oligonucleotides (oligos) [1]. 
These new assembly methods in principle reduce costs, improve access, and 
result in long sequences of error-free DNA that can be used to construct entire 
microbial genomes [2]. However, an increasing diversity of assembly methods is 
not matched by any obvious corresponding innovation in producing oligos. 
Commercial oligo production employs a very narrow technology base that is 
many decades old. Consequently, there is only minimal price and product dif- 
ferentiation among corporations that produce oligos. Prices have stagnated, 
which in turn limits the economic potential of new assembly methods that rely 
on oligos. Improvements may come via recently demonstrated assembly meth- 
ods that are capable of using oligos of lower quality and lower cost as feedstocks. 
However, while these new methods may substantially lower the cost of gene- 
length double-stranded DNA (dsDNA), they also may be economically viable 
only when producing many orders of magnitude with more dsDNA than what is 
now used by the market. The commercial success of these methods, and the 
broader access to dsDNA they enable, may therefore depend on structural 
changes in the market that are yet to emerge. 


1.1 Productivity Improvements in Biological 
Technologies 


In considering the larger impact of technological monoculture in DNA synthesis, 
it is useful to contrast DNA synthesis and assembly with DNA sequencing. In par- 
ticular, it is instructive to compare productivity estimates of commercially avail- 
able sequencing and synthesis instruments (Figure 1.1). Reading DNA is as crucial 
as writing DNA to the future of biological engineering. Due to not just commer- 
cial competition but also competition between sequencing technologies, both 
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1 Competition and the Future of Reading and Writing DNA 


Productivity in DNA sequencing and synthesis 
using commercially available instruments 
compared with Moore’s law (a proxy for IT productivity) 
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Figure 1.1 Estimates of the maximum productivity of DNA synthesis and sequencing enabled 
by commercially available instruments. Productivity of DNA synthesis is shown only for 
column-based synthesis instruments, as data for sDNA fabricated on commercially available 
DNA arrays is unavailable; exceptions are discussed in the text. Shown for comparison is 
Moore's law, the number of transistors per chip. (Intel; Carlson, 2010 [3]; Loman et al. 2012 [4]; 
Quail et al. 2012 [5]; Liu, 2012 [6].) 


prices and instrument capabilities are improving rapidly. The technological diver- 
sity responsible for these improvements poses challenges in making quantitative 
comparisons. As in previous discussions of these trends, in what follows I rely on 
the metrics of price [$/base] and productivity [bases/person/day]. 

Figure 1.1 also directly compares the productivity enabled by commercially 
available sequencing and synthesis instruments to Moore’s law, which describes 
the exponential increase in transistor counts in CPUs over time. Readers new to 
this discussion are referred to References 3 and 4 for in-depth descriptions of the 
development of these metrics and the utility of a comparison with Moore’s law 
[3, 7]. Very briefly, Moore’s law is a proxy for productivity; more transistors ena- 
ble greater computational capability, which putatively equates to greater 
productivity. 

Visual inspection of Figure 1.1 reveals several interesting features. First, gen- 
eral synthesis productivity has not improved for several years because no new 
instruments have been released publicly since about 2008. Productivity estimates 
for instruments developed and run by oligo and gene synthesis service providers 
are not publicly available.’ 


1 Itis likely that array-based DNA synthesis used to supply gene assembly operates at a much 
higher productivity than column-based synthesis. For example, Agilent reportedly produces and 
ships in excess of 30 billion bases of ssDNA a day, the equivalent of more than 10 human genomes, 
on an undisclosed number of arrays (Darlene Solomon, Personal Communication). 


1.2 The Origin of Moore’s Law and Its Implications for Biological Technologies 


Second, it is clear that DNA sequencing platforms are improving very rapidly, 
now much faster than Moore’s law. 

Moore’s law and its economic and social consequences are often used to 
benchmark our expectations of other technologies. Therefore, developing an 
understanding of this “law” provides a means to compare and contrast it with 
other technological trends. 


1.2 The Origin of Moore’s Law and Its Implications 
for Biological Technologies 


Moore's law is often mistakenly described as a technological inevitability or is 
assumed to be some sort of physical phenomenon. It is neither; Moore’s law is a 
business plan, and as such it is based on economics and planning. Gordon Moore's 
somewhat opaque original statement of what became the “law” was a prediction 
concerning economically viable transistor yields [8]. Over time, Moore’s eco- 
nomic observation became an operational model based on monopoly pricing, and 
it eventually enabled Intel to outcompete all other manufacturers of general 
CPUs. Two important features distinguish CPUs from other technologies and 
provide insight into the future of trends in biological technologies: the first is the 
cost of production, and the second is the monopoly pricing structure. 

Early on Intel recognized the utility of exploiting Moore’s law as a business 
plan. A simple scaling argument reveals the details of the plan. While transistor 
counts increased exponentially, Intel correspondingly reduced the price per 
transistor at a similar rate. In order to maintain revenues, the company needed to 
ship proportionally more transistors every quarter; in fact, the company increased 
its shipping numbers faster than prices fell, enabling consistent revenue to grow 
for several decades. This explains why Intel former CEO Andy Grove reportedly 
constantly pushed for an even greater scale [9]. 

In this sense, Moore’s law was always about economics and planning in a 
multibillion-dollar industry. In the year 2000, a new chip fab cost about $1 bil- 
lion; in 2009, it cost about $3 billion. Now, according to The Economist, Intel 
estimates that a new chip fab costs about $10 billion [9]. This apparent exponen- 
tial increase in the cost of semiconductor processing is known as Rock’s law. It is 
often argued that Moore’s law will eventually expire due to the physical con- 
straints of fabricating transistors at small length scales, but it is more likely to 
become difficult to economically justify constructing fabrication facilities at the 
cost of tens to hundreds of billions of dollars. Even through the next several itera- 
tions, these construction costs will dictate careful planning that spans many 
years. No business spends $10 billion without a great deal of planning, and, more 
directly, no business finances a manufacturing plant that expensive without 
demonstrating a long-term plan to repay the financiers. Moreover, Intel must 
coordinate the manufacturing and delivery of very expensive, very complex sem- 
iconductor processing instruments made by other companies. Thus Intel’s plan- 
ning and finance cycles explicitly extend many years into the future. New 
technology has certainly been required to achieve each planning goal, but this is 
part of the ongoing research, development, and planning process for Intel. 
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Moore’s law served a second purpose for Intel and one that is less well recog- 
nized but arguably more important; it was a pace selected to enable Intel to win. 
Intel successfully organized an entire industry to move at a pace only it could 
survive. And only Intel did survive. While Intel still has competitors in products 
such as memory or GPUs, companies that produced high volume, general 
CPUs have all succumbed to the pace of Moore’s law. The final component of this 
argument is that, according to Gordon Moore, Intel could have increased 
transistor counts faster than the historical rate.’ In fact, Intel ran on a faster 
internal innovation clock than it admitted publicly, which means that Moore’s law 
was, as one Intel executive put it, a “marketing head fake” [10]. The inescapable 
conclusion of this argument is that the management of Intel made a very careful 
calculation; they evaluated product rollouts to consumers — the rate of new prod- 
uct adoption, the rate of semiconductor processing improvements, and the finan- 
cial requirements for building the next chip fab line — and then set a pace that 
nobody else could match but that left Intel plenty of headroom for future prod- 
ucts. In effect, if not intent, Intel executed a strategy that enabled it to set CPU 
prices and then to reduce those prices at a rate no other company could match. 

This long-term planning, pricing structure, and the resulting lack of competi- 
tion contrasts quite strongly with the commercial landscape for biological tech- 
nologies. Whereas the exponential pace of doubling of transistor counts was 
controlled by just one company, productivity in DNA sequencing has recently 
improved faster than Moore’s law due to competition not just among companies 
but also among technologies. Conversely, the lack of improvement in synthesis 
productivity suggests that the narrow technology base for writing DNA has 
reached technical and, therefore, economic limits. Nonetheless, while Figure 1.1 
may suggest a temporary slowdown in the rate of improvement for sequencing, 
and in effect shows zero recent improvement for synthesis, new technologies will 
inevitably facilitate continued competition and, therefore, continued productiv- 
ity improvement. 


1.3. Lessons from Other Technologies 


Compared with that in other industries, the financial barrier to entry in biological 
technologies is quite low. Unlike chip manufacturing, there is nothing in biology 
with a commercial development price tag of $10 billion. The Boeing 787 report- 
edly cost $32 billion to develop as of 2011 and is on top of a century of multibil- 
lion-dollar aviation projects that preceded it [11]. Better Place, an electric car 
company, declared bankruptcy after receiving $850 million in investment [12]. 
Tesla Motors has reported only one profitable quarter since 2003 and continues to 
operate in the red while working to achieve manufacturing scale-up [13, 14). 
There are two kinds of costs that are important to distinguish here. The first is 
the cost of developing and commercializing a particular product. Based on the 


2 Gordon Moore to Danny Hillis, as related by Danny Hillis, Personal Communication. 


1.4 Pricing Improvements in Biological Technologies 


money reportedly raised and spent by Illumina, Pacific Biosciences, Oxford 
Nanopore, Life, Ion Torrent, and Complete Genomics (the latter three before 
acquisition), it appears that developing and marketing a second-generation 
sequencing technology can cost more than $100 million. Substantially more 
money gets spent, and lost, in operations before any of these product lines is 
revenue positive. Nonetheless, relatively low development costs have enabled a 
number of companies to enter the market for DNA sequencing, resulting in a 
healthy competition in a market that is presently modest in size but that is 
expected to grow rapidly over the coming decades. 


1.4 Pricing Improvements in Biological Technologies 


The second kind of cost to keep in mind is the use of new technologies to produce 
an object or produce data. Figure 1.2 is a plot of commercial prices for column- 
synthesized oligos, gene-length synthetic DNA (sDNA), and DNA sequencing. 
Prior to 2006, the sequencing market was dominated by Sanger-based capillary 
instruments produced by Applied Biosystems, in effect another pricing monop- 
oly. After 2006, the market saw a rapid proliferation of not just commercial but 
also technological competition with the launch of next-generation systems from 
454, Illumina, Ion Torrent, Pacific Biosciences, and Oxford Nanopore based on a 
diversity of chemical and physical detection methodologies [15]. Illumina 
presently dominates the market for sequencing instruments but is facing compe- 
tition from Oxford Nanopore and various Chinese insurgents. There also remains 
technological diversity between companies, which contributes to competitive 
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Figure 1.2 Commercial prices per base for DNA sequencing, column-synthesized 
oligonucleotides, and gene-length sDNA. Reported prices for array-synthesized oligos vary 
widely, and no time series is available. Market pricing for genes can vary by up to an order of 
magnitude, depending on sequencing composition and complexity. (Carlson (2010), 
Commercial price quotes.) 
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pressures. An important consequence of the emergence of technological 
competition in the DNA sequencing market is a rapid price decrease. The NIH 
maintains a version of this plot that compares sequencing prices with cost per 
megabyte for memory, another form of Moore’s law [16]. Both Figure 1.2 and the 
NIH plot show that sequencing costs kept pace with Moore’s law while a pricing 
monopoly was in effect. The emergence of technological competition produced 
both productivity improvements and price changes that outpaced Moore’s law. 

In contrast, despite modest commercial competition in the DNA synthesis 
market, the lack of technological competition has limited price decreases in the 
last 5 years. The industry as it exists today is based on chemistry that is several 
decades old, in which oligos are synthesized step by step on an immobilized 
substrate. Using array-synthesized oligos for gene assembly appears to be lower- 
ing the market price, though quality and delivery time are reportedly inconsist- 
ent across the industry. Improved error correction and removal technologies 
may further reduce the assembly cost for genes and thereby improve the profit 
margins [17]. My informal conversations with industry insiders suggest that 
oligo producers may no longer include the cost of goods in calculating prices; 
that is, oligo prices are evidently determined largely by the cost of capital rather 
than the cost of raw materials. This suggests that very little pricing improvement 
can be expected for genes produced from standard oligo synthesis. 


1.5 Prospects for New Assembly Technologies 


Array synthesis has the advantage of a low volume production of oligos with 
high library diversity [18]. Gene assembly based on array synthesis has proved 
difficult to commercialize. At least three companies in this space, Codon 
Devices, Gen9, and Cambrian Genomics, have gone bankrupt or been acquired 
in recent years. Twist, a more recent entrant, now quotes prices in the neighbor- 
hood of $10 per base and publicly asserts it will push prices much lower in the 
coming years. 

With prices potentially soon falling by orders of magnitude, one must ask 
about the subsequent impact on the market for synthetic genes. New firms enter- 
ing the market are implicitly working on the hypothesis that supply-side price 
reductions will drive increased demand. The most obvious source of that demand 
would be forward design of genetic circuits based on rational models. Yet the 
most sophisticated synthetic genetic circuits being constructed in industrial set- 
tings are designed largely using heuristic models rather than quantitative design 
tools [19]. Moreover, these circuits contain only a handful of components, which 
stand as a substantial bottleneck for demand. Alternatively, customers may 
employ less up-front predictive design and instead rely on high-throughput 
screening of pathway variants; screening libraries of pathways has the potential 
to create substantial demand for synthetic genes [20]. 

Considering the interplay between market size and price reveals challenges for 
companies entering the gene synthesis industry. Recalling the lessons of Moore’s 
law, a relatively simple scaling argument will reveal the performance constraints 


1.5 Prospects for New Assembly Technologies 


of the gene synthesis industry. Intel knew that it could grow financially in the 
context of exponentially falling transistor costs by shipping exponentially more 
transistors every quarter — that is, the business model of Moore’s law. But that 
was in the context of an effective pricing monopoly, and Intel’s success required 
a market that grew exponentially for decades. The question for synthetic gene 
companies is whether the market will grow fast enough to provide adequate rev- 
enues when prices fall. For every order of magnitude drop in the price of syn- 
thetic genes, the industry will have to ship an order of magnitude of more DNA 
just to maintain constant revenues. More broadly, in order for the industry to 
grow, synthesis companies must find a way to expand their market at a rate faster 
than when prices fall. Unfortunately, as best as I can tell, despite falling prices and 
putative increases in demand, the global gene synthesis industry generated only 
about $150 million in 2015 [21]. The total size of the industry appears to have 
been static, or even to have decreased, over the prior decade. 

Ultimately, for a new wave of gene synthesis companies to be successful, they 
have to provide their customers with something of value. Academic customers 
are likely to become more plentiful as it becomes even more obvious that order- 
ing genes is cheaper than cloning genes, even with graduate student labor costs. 
Gene synthesis pioneer John Mulligan used to cite NIH expenditures on 
cloning — approximately $3 billion annually — as a potential market size for gene 
synthesis [22]. This is certainly an attractive potential market. However, with the 
price per base potentially falling dramatically in the near term, the comparison to 
cloning must focus on the total number of cloned bases replaced by synthesis 
and at what exact price. 

For commercial customers, it is less obvious that lower prices will equate to sub- 
stantial increases in demand. The cost of sDNA is always going to be a small cost 
of developing a product, and it is not obvious that making a small cost even smaller 
will affect the operations of an average corporate lab. In general, research only 
accounts for 1-10% of the cost of the final product [23]. The vast majority of devel- 
opment costs are in scaling up production and in polishing the product into some- 
thing customers will actually buy. For the sake of argument, assume that the total 
metabolic engineering development costs for a new product are in the neighbor- 
hood of $50—100 million, a reasonable estimate given the amounts that companies 
such as Gevo and Amyris have reportedly spent. In that context, reducing the cost 
of sDNA from $50000 to $500 may be useful, but the corporate scientist-customer 
will be more concerned about reducing the $50 million overall costs by a factor of 
two, or even an order of magnitude, a decrease that would drive the cost of sDNA 
into the noise. Thus, in order to increase demand adequately, the production of 
radically cheaper sDNA must be coupled with innovations that reduce the overall 
the product development costs. As suggested above, forward design of complex 
circuits is unlikely to provide adequate innovation anytime soon. An alternative 
may be high-throughput screening operations that enable testing many variant 
pathways simultaneously. But note that this is not just another hypothesis about 
how the immediate future of engineering biology will change but also another gen- 
erally unacknowledged hypothesis. It might turn out to be wrong, and elucidating 
one final difference between transistors and DNA may explain why. 
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The global market for transistors has grown consistently for decades, driven by 
an insatiable demand for more computational power and digital storage. Every 
new product must contain more transistors than the model it replaces. In con- 
trast, while the demand for biological products is also growing, every new bio- 
logical product is made using, in principle, just one DNA sequence. In practice, 
while many different DNA sequences may be constructed and tested in develop- 
ing a new product, these many sequences are still winnowed down to only one 
sequence that defines a microbial, plant, or mammalian production strain. 
Nevertheless, this fundamental difference in use between transistors and DNA 
reveals the gene synthesis industry as the provider of engineering prototypes 
rather than as a large volume manufacturer of consumer goods. Consequently, 
while high-throughput synthetic biology companies such as Amyris, Ginkgo 
Bioworks, and Zymergen may place relatively large orders for sDNA, the price 
and volume of that sDNA will never have much impact on the final products 
produced by those companies. 


1.6 Beyond Programming Genetic Instruction Sets 


At present, the cost of purifying oligos and short dsDNA can exceed the cost of 
the DNA itself by as much a factor of three. The availability of lower cost, high 
quality dsDNA may therefore enable applications that are presently not econom- 
ically viable at large scale. Beyond its utility in programming biological systems, 
dsDNA can be used as nanoscale structural or functional components [24]. The 
future of these applications is difficult to predict but could include circuitry 
assembled from DNA that is modified using proteins and chemistry to create 
conductive and semiconductive regions useful for computation [25]. It is unclear 
what sDNA market size these applications may support. Recent progress sug- 
gests that new demand might emerge from the use of DNA as a digital informa- 
tion storage medium [26]. Even a single, modestly size data center would consume 
many orders of magnitude of more sDNA than any prospective use of sDNA in 
biological contexts [27]. 


1.7 Future Prospects 


Regardless of the particular course of companies entering the gene synthesis mar- 
ket, it appears that prices are likely to fall, potentially fueling an increase in demand. 
That demand may come in part from customers who fall outside the usual aca- 
demic and corporate classifications; start-up companies, community labs, and 
individual, independent entrepreneurs and scientists are likely to use sDNA in new 
and interesting ways. The standing biosecurity strategy of the United States is to 
explicitly engage and encourage this innovation, including in contexts such as 
“garages and basements” [28]. This strategy recognizes the important role of entre- 
preneurs in innovation and job creation and also recognizes the difficulty of pre- 
venting access to biological technologies through regulations or restrictions. 


References 


Complementing the engagement strategy is an effort to prevent accidentally 
synthesizing and shipping potentially hazardous sequences. Most gene synthesis 
companies have voluntarily signed onto international agreements to screen orders 
against lists of pathogens and toxins such as the Harmonized Screening Protocol of 
the International Gene Synthesis Consortium (IGSC) [29]. 

The technical potential of new sDNA production methods may provide an 
opportunity to build and test far more genetic circuit designs than what is now 
feasible. The economic demand for biological production is enormous and is 
growing rapidly [30, 31]. Whether newly emerging sDNA companies survive 
economically depends in large part on their ability to increase total market 
demand sufficiently to offset falling prices. The size of that market, in turn, 
largely depends on whether less expensive dsDNA enables customers to reduce 
research and development costs and to create more products. The fundamental 
problem for the synthesis industry is that, however valuable sDNA is substan- 
tively to biological engineering in practice, the monetary value of that DNA is 
small compared with total development costs and has been falling, at times very 
rapidly, for decades. Falling prices limit both the maximum profit margin and the 
incentive to invest in new technology. Any new technology that does enter the 
market will inevitably drive competition, further depressing prices and margins. 
Going forward, productivity and prices are likely to display step changes result- 
ing from the emergence of new technology and competition rather than display 
smooth long-term changes. Finally, given the relatively low barriers to entry for 
biological technologies and the consequent inevitable competition, it is worth 
asking whether centralized production is the future of the industry. As with 
printing documents, it may be that the economics of printing and using DNA 
favor distributed production, perhaps even a desktop model. There is no funda- 
mental barrier to integrating any demonstrated synthesis and assembly technol- 
ogies into a desktop gene printer. Ultimately, over the long term, a globally 
expanding customer base will ultimately determine how sDNA is produced and 
used. Regardless of how current technology specifically impacts supply and 
prices, that customer base is increasing, and it is likely that the trends displayed 
in Figures 1.1 and 1.2 will continue for many years to come. 
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Trackable multiplex recombineering (TRMR) allows researchers to explore the 
otherwise large mutational space of the Escherichia coli genome efficiently. This 
method is used to simultaneously change the expression level of every gene in 
the genome, so that each gene is either overexpressed or switched off. A variation 
on TRMR, tunable trackable multiplex recombineering (T?RMR), allows expres- 
sion levels to be tuned over a 10*-fold range. TRMR and T’RMR therefore allow 
bacterial responses to be tuned to different environmental cues. Additionally, the 
genomic changes can be tracked and identified for population dynamic studies 
and for further analyses thanks to “barcoding” (or “tagging”) of every mutation. 
The TRMR and T°RMR procedures include library design, production, and 
amplification, followed by the insertion of the DNA library into a precise loca- 
tion in the genome via phage-enabled homologous recombination. Then, the 
heterogeneous bacterial population is subjected to a defined stress or screened 
for a specific trait. Finally, beneficial mutations are identified by means of bar- 
code hybridization to a microarray or by sequencing. Importantly, TRMR- and 
T°RMR-based populations can be established by a single scientist in a single day, 
and depending on the desired trait, genome-wide mapping results may be 
obtained as shortly as within a week. 


2.1 Introduction 


While traditional engineering usually involves the design and production of 
mechanical structures and devices, biological engineering is focused on modify- 
ing the natural world and adapting it to human needs. Although many consider 
biological engineering to be a new field, it is as old as civilization itself. Selective 
breeding of plants and animals for specific traits exemplifies one of the hallmarks 
of biological engineering and evolution in general: selection of a successful sub- 
population that will establish the next generations, thus continuously refining 
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the desired phenotype. The field of genetics and the discovery of the mecha- 
nisms by which traits are propagated along generations allowed people, for the 
first time, to rationally induce genetic modifications rather than wait for them to 
occur randomly. Early efforts focused on rational transfer and modifications of 
single genes and were collectively termed “genetic engineering” Complex traits, 
however, derived from multiple gene interactions or whole metabolic pathways, 
cannot be efficiently engineered one gene at a time and require high-throughput 
and systemic approaches, the recognition of which formed the basis of the field 
of metabolic engineering [1, 2]. 

Since then, advances in DNA sequencing and systems biology methodologies 
have led to exceptional new approaches for characterizing complex traits and 
their underlying genetic networks. Additionally, rapid progress in DNA chemical 
synthesis and the development of recombination-based methods now allow 
mutations to be incorporated in multiplex and at a throughput orders of magni- 
tudes beyond the state of the art a decade ago [3-6]. In contrast to earlier meth- 
ods of individually synthesizing oligonucleotides or using DNA segments from 
natural sources, current technology allows the parallel production of synthetic 
DNA (synDNA) libraries [5]. Additionally, homologous recombination-based 
techniques (recombineering) that promote the integration of foreign DNA into 
the chromosome of the target organism have reached relatively high levels of 
efficiency [6]. Recombineering in E. coli is based on targeting a synthetic recom- 
bineering substrate (a single-stranded (ss) DNA oligonucleotide or a double- 
stranded (ds) DNA cassette) to a specific locus on the chromosome via homology 
arms. Typically, this DNA substrate contains a desired mutation and may also 
code for an antibiotic resistance gene as a selective marker. The actual recombi- 
nation is enabled by either the Rec E/T or the A-Red prophage system [6, 7]. 

Here, we describe the TRMR and T’RMR techniques, which not only make the 
multiplexing of recombineering possible in E. coli but also provide the ability to 
track the engineered genetic changes accurately. Currently, both library designs 
allow one to target, in parallel, every gene in the genome for either overexpres- 
sion or downregulation, with T°’RMR allowing for tuning of gene expression over 
a ~10*-fold range. The trackability is achieved by adding a unique “molecular 
barcode” [8] upstream of every mutation, facilitating its identification. These 
methods enable the search for specific and desired genetic traits and aid in the 
navigation of an otherwise large mutational space (i.e., in this case, the total 
number of possible single mutations). We discuss the benefits of such methods, 
existing challenges, possible combinations with other methods, and some 
possible future development and applications. 


2.2 Current Recombineering Techniques 


E. coli did not evolve an efficient mechanism for recombination; therefore spon- 
taneous homologous recombination of foreign genetic material is typically a rare 
event, on the order of 10° for linear ssDNA or dsDNA substrates [9]. It has been 
suggested that the low efficiency is primarily due to endogenous nucleases that 
rapidly degrade the unprotected DNA [10, 11]. Phage-based methods, which rely 
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on the overexpression of phage proteins that prepare and protect the DNA sub- 
strate, have been developed to more efficiently induce the incorporation of 
desired DNA segments into cells. Two popular methods are the Rec E/T and the 
A-Red prophage systems. 


2.2.1 Recombineering Systems 


The Rec E/T system encodes for two proteins, namely, RecE, a 5’ > 3’ exonucle- 
ase, and RecT, an ssDNA binding protein [7]. The A-Red system encodes for 
three proteins: Exo (homologous to RecE), Beta (homologous to RecT), and 
Gam, an inhibitor of the endogenous RecBCD exonuclease, which acts to protect 
the foreign DNA from active degradation [12]. The foreign DNA to be recom- 
bineered into the host genome may be in one of two forms depending on its 
source. Synthetic oligonucleotides (oligos) will usually be ssDNA, while poly- 
merase chain reaction (PCR)-amplified segments are double stranded. In order 
to be incorporated into the host genome, the recombineering substrate includes 
the desired DNA sequence to be incorporated and homology regions that flank 
both sides of this DNA sequence. The homology regions direct the DNA sub- 
strate to a specific location in the genome, where the endogenous replication 
machinery uses it as a template for replication. Here, we will focus on the A-Red 
system, in which the Exo, Beta, and Gam proteins work in concert to induce 
homologous recombination of DNA fragments into the host genome. 

Currently, the most popular vectors for the A-Red system are either the pKD46 
or the pSIM5 plasmids [11, 13, 14]. Protein expression from these vectors is 
induced by incubation with arabinose or at 42°C, respectively. Both are addition- 
ally temperature sensitive at 37°C, which allows for plasmid curing following 
expression of the A-Red proteins. The standard A-Red recombineering workflow 
includes transforming the host strain with the recombineering plasmid of choice, 
induction of the recombineering machinery, and additional transformation with 
the desired recombineering substrate, followed by selection/screening for suc- 
cessful recombinant strains [11]. 


2.2.2 Current Model of Recombination 


Several models of the exact recombination mechanism exist; however, the “rep- 
lication fork annealing model” is currently the most supported experimentally 
(Figure 2.1). According to this model, if the recombineering substrate consists of 
dsDNA, the A-Red Exo protein, through its exonuclease activity, transforms it 
into ssDNA [16]. This model suggests that although some dsDNA is being com- 
pletely degraded by Exo molecules that digest the recombineering substrate from 
both sides, in some cases one strand is digested completely before the other side 
is attacked by another Exo molecule, rendering the resulting ssDNA immune 
from further Exo digestion. If the recombineering substrate is ssDNA, no action 
by Exo is required. In both cases, the (resulting) ssDNA strand is protected from 
further degradation by endogenous nucleases via Beta, which binds to the ssDNA 
and escorts it to single-stranded areas in the chromosome [17, 18]. Single- 
stranded regions occur during DNA repair, transcription-induced supercoiling, 
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Figure 2.1 The A-Red system and the replication fork annealing model of recombination. 
Either a double- or single-stranded recombineering substrate, consisting of the DNA 
sequence to be inserted flanked by homology arms, is transformed into cells. The A-Red 
proteins facilitate recombination by digesting one strand of DNA in the case of dsDNA (Exo), 
by inhibiting RecBCD nuclease activity (Gam), and by protecting and conveying the ssDNA to 
the replication fork (Beta). Then, the ssDNA acts as a mismatched Okazaki fragment and binds 
to the lagging strand via its homology arms. This process results, upon completion of cell 
duplication, with one wild-type daughter cell and one recombineered, heterozygous-like 
daughter cell. Reprinted with permission from Pines et al. 2015 [15]. Copyright 2015 American 
Chemical Society. 


and, importantly, all along the genome during chromosome replication. Studies 
show that the ssDNA substrate anneals to the chromosome in a strand-biased 
manner, which correlates with the direction of DNA replication [19, 20]. These 
results suggest that the ssDNA annealing is directed to the lagging strand of the 
replication fork via its homology regions, where it essentially acts as an exoge- 
nous DNA-based Okazaki fragment [16, 21]. Overall, the recombineering 
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process results in two daughter cells, one of which harbors the desired genetic 
modification, while the other remains genetically identical to its parental ances- 
tor, limiting this method to a theoretical maximum efficiency of 50% [15]. 


2.3 Trackable Multiplex Recombineering 


The E. coli genome consists of over 4000 genes. When engineering the E. coli 
genome for a desired trait (e.g., tolerance to a growth condition or increased 
production of a valuable chemical), combinations of multiple genetic modifica- 
tions are often required to achieve optimal performance. The result is a combi- 
natorial mutation space that expands exponentially with the number of targeted 
genes and quickly exceeds the size of space that can be searched on laboratory 
time scales. For example, if each of the 4000 genes is modified to both an “off” 
and an “on” state, there are 2“ possible states. TRMR and T’RMR provide a 
rapid and efficient way to modify an entire genome in a controlled manner and 
to evaluate the effects of those genetic modifications simultaneously. Using these 
techniques it is possible to modify >95% of the genes in E. coli in a single day. An 
overview of the TRMR and T?RMR techniques is shown in Figure 2.2. In order 
to engineer a genome using TRMR or T’RMR, a synDNA cassette is created 
that encodes for a genetic feature (such as the overexpression or underexpression 
of each specific gene) and a molecular barcode that is used to track each feature. 
These synDNA cassettes are then introduced in parallel into cell populations via 
recombineering. Next, the modified populations are grown in any desired growth 
condition or in selective medium. Microarray or sequencing analysis of the 
molecular barcodes is used to determine the relative fitness of each allele/ 
engineered cell in the surviving population under the chosen conditions. TRMR 
and T°’RMR libraries must be used to evaluate a phenotype that can be either 
selected or screened for. 

To date, TRMR and T?RMR have been used to map genes required for growth 
in various types of media and to optimize tolerance to acetate, low pH, cellulosic 
hydrolysate, isobutanol, ethanol, isopentenol, furfural, and various antibiotics 
[22-26]. These studies have given insight into carbon source and vitamin utiliza- 
tion, primary and secondary metabolism, and mechanisms of toxicity under a 
variety of conditions. 

While the next few paragraphs will provide basic information on the TRMR 
and T*RMR methods, readers are referred to an in-depth protocol for complete 
experimental details of TRMR [27]. 


2.3.1 TRMR and T’RMR Library Design and Construction 


All TRMR and T’RMR libraries consist of two main parts: (i) “targeting” oligos 
that contain homology to each gene in the E. coli genome, a molecular barcode 
to identify each oligo uniquely, and sequences used to amplify each region of the 
oligo by PCR, and (ii) “shared DNA” that encodes for a genetic function and an 
antibiotic resistance marker. These two parts are then ligated to each other to 
create synDNA cassettes that can then be amplified, linearized, and transformed 
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Figure 2.2 Overview of trackable multiplex recombineering (a) and tunable trackable 
multiplex recombineering (b). TRMR and T’RMR cassettes are designed and synthesized in 
multiplex followed by transformation into Escherichia coli. The E. coli population is then placed 
under selective pressure. Alleles that are enriched during selection are identified by 
microarray (TRMR) or next-generation sequencing (T?RMR), and their relative fitness is 
determined. Cassette design for each technique is shown at the top. Black regions are shared 
DNA and gray regions are from the targeting oligos. HA1 and HA2, homology regions; P, 
barcode priming site; G, barcode identifying the gene; B, barcode identifying the BCD; BlastR, 
blasticidin resistance gene; KanR, kanamycin resistance gene; stop, three frame stop codons; 
Ts, terminator spacer; Tp, terminator pause; Pi, promoter insulator; LaclO, Lacl-regulated 
synthetic inducible promoter (apFAB906); BCD, bicistronic design (dual RBS). Adapted with 
permission from Freed et al. 2015 [22]. Copyright 2015 American Chemical Society. 


into cells (Figure 2.3). During amplification, circular concatemers are created. 
These concatemers are then cleaved by restriction enzyme digest to generate 
linear dsDNA with the homology regions, molecular barcodes, antibiotic resist- 
ance, and gene modification sequences in the correct order. The synDNA cas- 
settes are linearized because linear DNA, and not circular DNA, is generally 
used as a substrate for recombineering using the A-Red system [6]. 
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Figure 2.3 Construction and incorporation of TRMR (a) and T7RMR (b) cassettes. In both TRMR and T°RMR, targeting oligos 
are ligated with a shared DNA cassette encoding a specific genetic function. The ligated synDNA cassettes are amplified by 
rolling circle amplification and then cleaved to create a linear dsDNA substrate. (c) The linear synDNA cassettes are 
recombineered into cells, targeting all genes at one time. Adapted from Warner et al. 2010 [26]. 
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In both TRMR and T?RMR, targeting oligos were designed using homology 
regions that result in the synDNA cassette being inserted upstream of each gene 
and replacing each gene’s start codon in E. coli MG1655. In the original demon- 
stration of TRMR [26], all protein-coding genes were targeted. Each targeting 
oligo also contained a unique 20-nucleotide sequence that served as a molecular 
barcode (or “tag”) used to track each gene. The barcodes were chosen from a set 
that had previously been used successfully in yeast [8]. In T’RMR [22], pseudo- 
genes and noncoding RNAs were targeted in addition to all protein-coding 
genes. Each targeting oligo contained a 12-nucleotide sequence that had been 
optimized to serve as a molecular barcode for high-throughput sequencing. 
Targeting oligos for both TRMR and T’RMR were synthesized on a microchip by 
Agilent. 

The most significant differences between TRMR and T’RMR are in the design 
of the shared DNA. TRMR consists of two libraries: an “up” library that causes 
genes to be overexpressed and a “down” library that causes genes to be underex- 
pressed. The shared DNA for the up cassette contains the strong PLtetO-1 pro- 
moter and a strong ribosome binding site (RBS), which generally results in 
increased transcription and translation of downstream genes. The shared DNA 
for the down cassette contains no promoter and no RBS, resulting in the deletion 
of the native RBS and subsequent decrease in translation initiation. The activity 
of the B-galactosidase protein (JacZ gene) was used to confirm that the up con- 
struct for this protein resulted in overexpression and the down construct resulted 
in loss of expression of the protein (Figure 2.4a). 

While these libraries have been successful in identifying alleles responsible for 
a desired phenotype, there are some limitations to the original TRMR libraries. 
One drawback is that these libraries do not use standardized synthetic parts, 
which may result in inconsistent expression levels across targeted genes, since it 
is known that placing the same promoter or RBS in front of two different genes 
can cause the two genes to be expressed at vastly different levels [22-25]. Another 


lacZ up lacZ down = 
xe) 
; ‘@ 100000 
f / ih / Mp & 10000 
. 3 1000 
r Af as IPTG + X-gal 2 
Glucose + X-gal \W ny S 100 
f 3 10 RY 
| ® | 
4 MN a 5 1 a F 
2 0 0.01 0.125 1 
(a) Wild type Wild type IPTG (M) 
MG1655 Off Low 
(b) Intermediate [J High 


Figure 2.4 Validation of TRMR (a) and T7RMR (b) cassettes using the lacZ gene. In TRMR, the 
“up” cassette causes /acZ to be expressed, while the “down” cassette results in a loss of 
expression. In T’RMR, varying the four libraries (“off”“low,”“intermediate,” and “high”), and the 
amount of inducer (IPTG) allows lacZ to be expressed over a ~10*-fold range. Adapted with 
permission from Warner et al. 2010 [26] and Freed et al. 2015[22]. Copyright 2015 American 
Chemical Society. 
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limitation is that the original TRMR libraries are binary in nature and do not 
allow the tuning of expression levels. When engineering a synthetic pathway, the 
activity of the pathway can be dependent on expressing each protein in a very 
narrow expression range [28] rather than in a binary way. 

The design of the shared DNA for T’RMR addresses these limitations. 
T’RMR uses a “bicistronic design” (BCD), devised by Mutalik et al. [29], that 
embeds two RBSs in a nucleotide sequence that encodes a 16-amino-acid pep- 
tide, which is then placed directly in front of a gene of interest. The dual RBSs 
embedded in this leader peptide affect ribosome binding and translation initia- 
tion, and therefore these BCD constructs give much more consistent expres- 
sion when tested with a variety of genes. Mutalik et al. further tested several 
hundred promoter variants, as well as combinations of BCDs and promoters, to 
express genes over a wide dynamic range. In 93% of cases, they found they could 
predict the expression level of a protein to within twofold, which is a great 
improvement over using a single RBS [29]. T’RMR consists of four libraries/ 
shared DNAs, each expressing a different BCD, to give four base expression 
levels: “off? “low,’ “intermediate,” and “high. Each library contains a 12-nucleo- 
tide barcode to identify the BCD during high-throughput sequencing. An 
inducible Lacl-regulated promoter was placed in front of each BCD allowing 
for fine-tuning of gene expression to almost any level that is desired just by 
changing the amount of inducer (isopropyl f-p-1-thiogalactopyranoside 
(IPTG)) that is added to the medium. Validation of T’RMR with the 
B-galactosidase protein (lacZ gene) gave expression over a ~10*-fold activity 
range, as measured by the Miller assay, by using different combinations of the 
libraries and amounts of IPTG (Figure 2.4b). 


2.3.2. Experimental Procedure 


The experimental procedure is the same for both TRMR and T’RMR. Once the 
double-stranded, linearized synDNA libraries have been constructed, they are 
incorporated into the genome by homologous recombination. E. coli cells con- 
taining the A-Red recombination proteins (either integrated directly on the 
chromosome or expressed from a plasmid such as pSIM5 [14]) are grown at 
30°C to mid-log phase in medium containing the appropriate antibiotic if 
required. If using pSIM5, expression of the A-Red enzymes is induced by incu- 
bating cells at 42°C for 15min. The cells are then chilled on ice and made elec- 
trocompetent by washing with ice cold water as previously described [11]. 
SynDNA is then transformed into cells by electroporation. After 2h of recovery 
at 37°C, cells are spread on plates containing the antibiotic resistance marker 
that is selective for library clones. Plates are incubated at 37°C for 22h and then 
colonies are scraped from the plates, resuspended in LB medium, and aliquoted 
for storage at -70°C. 

Screening and selection of TRMR and T’RMR clones can be carried out using 
either liquid or solid media with any chemical compound that modifies growth 
or confers a phenotype that can be detected by a high-throughput assay. An 
equal number of cells from all libraries (either up and down or off, low, interme- 
diate, and high) are mixed together in medium containing the antibiotic that is 
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selective for the libraries and are grown to late log phase. An aliquot of this cul- 
ture is frozen for further analysis, with the remainder of the culture being used 
for selections. For selections in liquid medium, cells from the initial culture are 
inoculated into medium containing the desired chemical compound and are 
grown to stationary phase. Cells are then harvested for analysis as discussed in 
the following section. For selections on solid medium, cells from the initial cul- 
ture are spread on plates containing the desired chemical compound, and plates 
are incubated until colonies are visible. All colonies are scraped from the plates 
for further analysis. 


2.3.3 Analysis of Results 


Either microarray or sequencing analysis can be used to determine the relative 
fitness of each allele after selection. Genomic DNA is extracted from cells before 
and after selection (and at various points during selection if desired), and the 
molecular barcodes from each sample are amplified by PCR. In the original set of 
TRMR experiments, barcodes were hybridized to the GenFlex Tag4 array from 
Affymetrix [30]. A separate microarray experiment was done for each library. 
Ten barcodes were spiked into the genomic DNA mixture in known amounts to 
determine barcode concentrations, and 1642 negative-control barcodes were 
also included to determine background hybridization rates. Allele frequencies 
for each gene were then determined by dividing allele concentrations by the total 
concentration of all alleles detected on the array. 

In T’RMR, molecular barcodes that are optimized for high-throughput 
sequencing were used instead of microarray to track alleles. Each allele had two 
barcodes — one identifying the library (off, low, intermediate, high) and one iden- 
tifying the gene. All samples were combined into a single MiSeq run. High- 
throughput sequencing allows more quantitative analysis of genotype frequencies, 
since individual alleles are directly tracked at the nucleotide level rather than by 
relative hybridization intensity (measured in arbitrary fluorescence units). A sin- 
gle run of Illumina MiSeq can generate 10°-10’ sequencing reads (a typical 
microarray signal distribution ranges over about 10°), allowing for each barcode 
to be sequenced thousands of times. This deep sequencing additionally aids in 
the detection of rare alleles [31], which might be present in too low a concentra- 
tion to be identified by microarray hybridization. A microarray hybridization 
signal can also saturate [30], resulting in loss of data for the most highly expressed 
alleles. A second advantage to high-throughput sequencing is that it results in a 
lower error rate in identifying alleles. Although hybridization to a microarray, in 
general, gives high fidelity, some alleles will fail to hybridize to the sequence on 
the microarray that is perfectly complementary [32]. Furthermore, errors in bar- 
code sequences can be introduced during cell replication, DNA synthesis, or 
PCR amplification, resulting in loss of hybridization or, worse, in hybridization 
to an incorrect spot on the microarray [32]. With high-throughput sequencing, 
on the other hand, errors in barcode sequences can be identified and corrected 
or discarded [33], as was done with T’RMR. In T’RMR, allele frequencies for 
each gene were determined by dividing the number of barcode counts for that 
gene by the total number of barcode counts for all genes. 


2.4 Current Challenges 
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Figure 2.5 T’RMR has significantly increased ability to discriminate between MOPS minimal 
medium and LB-rich medium. The Pearson dissimilarity (0 indicates perfectly linearly 
correlated, and 2 indicates negatively correlated) between MOPS and LB samples for each 
library type is shown. * indicates p < 0.05 Benjamini—Hochberg corrected significance. 
Adapted with permission from Freed et al. 2015 [22]. Copyright 2015 American Chemical 
Society. 


In both TRMR and T’RMR, relative fitness was calculated by determining the 
ratio of the final allele frequency after selection to the initial allele frequency. Any 
allele that increases in frequency after selection is likely to confer tolerance to the 
selective condition. While both TRMR and T’RMR are successful at identifying 
alleles responsible for a desired phenotype, T’RMR may be able to identify alleles 
that improve fitness under weak selective pressure that the original TRMR would 
not be able to identify. For example, T’7RMR does significantly better than TRMR 
at discriminating between LB and MOPS growth media (Figure 2.5). In all cases, 
to confirm a fitness advantage, it is advisable to analyze the growth of cells con- 
taining each individual allele that is enriched during selection and compare it 
with wild-type cells. 


2.4 Current Challenges 
TRMR and T’RMR are novel and powerful techniques that allow for the modifi- 


cation and tracking of thousands of genes in a single step. However, there are 
some challenges that need to be considered when performing TRMR or T’RMR. 
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2.4.1 TRMR and T*RMR are Currently Not Recursive 


One challenge for the current TRMR and T?RMR designs is that only a single 
round of recombineering is currently implemented. This limitation is due to 
the fact that the TRMR and T’RMR substrates are dsDNA, and because of the 
current low efficiency of dsDNA recombineering, it is essential that an antibi- 
otic selection step is used to ensure the removal of non-recombineered cells. 
While multiple different antibiotic markers may be used for successive rounds 
of recombination with a TRMR or T*RMR library, the limited number of mark- 
ers restricts the number of additional cycles that can be performed. Another 
option is to remove the resistance gene (via flanking FRT sites) and reintro- 
duce the same library in the next recombineering round, but this will greatly 
extend the time required for every cycle. Alternately, a new technology called 
CRISPR-enabled trackable genome engineering (CREATE) has been developed 
that allows clustered regularly interspaced short palindromic repeat (CRISPR)- 
based markerless selection of recombineered cells [34]; this technique could be 
combined with the expression-level cassettes from TRMR or T?RMR as dis- 
cussed in Section 2.5.2. 

An additional concern is the rapid increase in the number of recombinants 
that result from every TRMR or T?RMR cycle. In the ideal case, every possible 
combination of mutations should be represented in the cell population, which 
would require a volume of cells that exceeds the capability of current equip- 
ment. Lastly, barcode identification is being performed at the whole population 
level and thus does not distinguish whether the barcode originated from a 
single or multiple cells. This limitation could be overcome by using new single- 
cell sequencing technologies including (i) single-cell linkage PCR, which allows 
for the sequencing of millions of barcoded individual cells [35] and (ii) tracking 
combinatorial engineered libraries (TRACE), which gives the ability to track 
combinations of mutations from a single cell [36, 37]. 


2.4.2 Need for More Predictable Models 


Mathematical modeling of a metabolic pathway can be a valuable tool for further 
optimization of that pathway [reviewed in [38]]. Once the metabolic flux through 
a pathway is accurately modeled, bottlenecks in that pathway can be identified, 
and further engineering efforts can be directed toward removing that bottleneck. 
TRMR and T?RMR can aid in the development of models by identifying genes 
that are involved in a pathway and that would have been difficult to predict 
a priori [22, 26]. Once these genes have been identified, new TRMR-like libraries 
that are predicted to be enriched for better performing strains can be designed. 

Although metabolic models can be useful, unfortunately they often lack pre- 
dictive power [reviewed in [39]]. This can be due to a number of factors includ- 
ing lack of mechanistic detail about the pathway, inconsistent behavior of 
synDNA parts, or failure to account for epistatic interactions [25]. Epistatic 
interactions can make both TRMR and T’RMR data particularly difficult to 
model. The development of more predictive models is an active and ongoing part 
of metabolic engineering research. 


2.5 Complementing Technologies 


2.5 Complementing Technologies 


2.5.1 MAGE 


The TRMR and T’RMR approach of targeting all genes simultaneously in a 
trackable manner may prove beneficial for selecting candidate genes for other 
downstream techniques in the pursuit of improved production of chemicals 
and for strain development. The advantage TRMR and T?RMR provide is the 
potential discovery of genes whose involvement in any specific pathway is cur- 
rently impossible to predict using computational or other methods. For exam- 
ple, gene candidates can be derived from a tolerance experiment, as mentioned 
earlier. However tolerance may be increased even further by combining several 
such mutations via multiplex automated genome engineering (MAGE) [40] or 
by using another combinatorial, recursive multiplex recombineering tech- 
nique. Not only do such combinations dramatically increase the mutational 
space, but combinatorial experiments also must consider epistatic effects 
among the combined mutations, which are extremely difficult to predict a pri- 
ori [25]. Additionally, the question of which candidate genes should be included 
in the second-step MAGE experiment is far from trivial. Intuitively, one might 
pick the top performing genes for combinatorial experiments. However this 
approach, termed the “greedy approach,’ might result in reaching a local 
maximum in the potential fitness landscape rather than the desired global 
maximum. Current computational efforts are being carried out to tackle these 
challenges. 


2.5.2. CREATE 


To date, TRMR and T’RMR have only been used for modifying the expression 
level of genes. However, work done on the engineering of biocatalysts has shown 
that in some cases a single point mutation can alter the catalytic activity of an 
enzyme (reviewed in [41]), suggesting that big advances can come from subtle 
changes. Barcoded editing at the single nucleotide polymorphism (SNP) level 
will therefore lead to even faster improvements in strain engineering and path- 
way optimization. 

Recent technologies were designed to address this need for higher-resolution 
genome editing, namely, creating point mutations within an open reading frame. 
The first generation of these ideas took advantage of the newly discovered 
CRISPR/Cas9 systems to increase editing efficiency and introduce single edits 
[42-44]. Multiplex editing of numerous sites quickly followed [34]. Similar to 
TRMR and T?RMR, CREATE utilizes the 4-Red recombineering system and 
array-based DNA synthesis to create rationally designed edits. Here, however, 
the CRISPR/Cas9 system is used for increasing editing efficiency and for the 
removal of non-edited genomes. CRISPR/Cas9-based editing technologies take 
advantage of the RNA-guided endonuclease activity of the Cas9 protein [45]. 
This activity depends on a dinucleotide GG protospacer adjacent motif (PAM), 
leading to a site-specific double-strand break in the cell’s genomic DNA and sub- 
sequent cell death (in cells deficient of an efficient double-strand break repair 
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mechanism, such as E. coli). However, cells containing a mutation in the PAM 
site are protected from DNA cleavage [42]. 

Each CREATE cassette includes the target site in a gene and a proximal PAM 
sequence that are selected for mutagenesis. Since in most cases the PAM falls 
within the open reading frame, it is silently mutated, so the amino acid sequence 
is not altered. To allow multiplex editing, the PAM-specific corresponding gRNA 
coding sequence is co-synthesized with the target site editing oligo and cloned 
into an editing vector, which also serves as a target-specific barcode. This design 
enables the creation of barcoded libraries composed of tens of thousands of cells, 
with each genome having a single amino acid edit. Hence, TRMR or T7RMR and 
CREATE can be used in conjunction, with TRMR or T’RMR identifying impor- 
tant genes under specific conditions and CREATE allowing for the engineering 
of those genes for optimal results. These two technologies could additionally be 
combined in the future (this would require new technology allowing for an 
increase in the length of targeting oligos that can be synthesized on a microchip). 
Adding the main T’RMR elements to the CREATE cassette design can allow 
higher versatility in editing. Gene expression tuning can be coupled to gene edit- 
ing, enabling the investigation of expression in conjunction with point muta- 
tions. The ability to cycle these edits for the generation of multiple diverse 
genotypes will help researchers to isolate desired complex traits that combine 
both protein sequence and expression level. 


2.6 Conclusions 


Recent advances in DNA synthesis and the development of standardized genetic 
parts have greatly increased genome engineering capabilities. TRMR and T7RMR 
allow a single researcher to modify an entire genome in a single day and map 
which alleles are responsible for a desired phenotype. This ability to fine-tune 
expression levels, particularly when combined with other technologies for mak- 
ing point mutations or combinatorial mutations, will allow researchers to quickly 
and easily engineer strains for maximal production of, or tolerance to, any 
compound. 


Definitions 


Recombineering Genetic engineering using homologous recombination 

TRMR Trackable multiplex recombineering 

Recombineering substrate A single- or double-stranded piece of DNA that is 
to be inserted into the target’s genome via recombineering 

A-Red system Using A-Red phage proteins to enable homologous recombina- 
tion in bacteria 

Molecular barcode Unique, short nucleotide sequence used to identify and 
track a specific gene or piece of DNA 

Synthetic DNA Artificial DNA that is synthesized; does not have to be based on 
naturally occurring DNA sequence 
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The ability to precisely manipulate genomic DNA in living cells in a site-specific 
manner has revolutionized biomedical research. Site-specific genomic modifica- 
tion has greatly advanced preclinical research by creating invaluable cellular and 
animal models of disease and is currently in clinical trials for therapeutic applica- 
tion. Zinc finger proteins (ZFPs) represent a class of proteins that can be engi- 
neered to manipulate user-defined chromosomal DNA targets with a high degree 
of specificity. Zinc finger nucleases (ZFNs) cause double-stranded breaks (DSBs) 
at precise genomic locations that can induce deletions, insertions, transloca- 
tions, and/or point mutations in the genomic DNA via endogenous DNA repair 
mechanisms. ZFPs fused to recombinases or transposases act in an autonomous 
manner without the need to induce toxic DSBs. This chapter represents an over- 
view of ZFPs, the various methods available to researchers for engineering them, 
options for genomic modifications, methods for validation of genomic modifica- 
tions, an overview of options for delivery to cells, and some novel ways that zinc 
fingers (ZFs) are being used for genomic alteration. 


3.1. Introduction to Zinc Finger DNA-Binding 
Domains and Cellular Repair Mechanisms 


3.1.1 Zinc Finger Proteins 


The Cys-His, ZF domain makes up the most common DNA-binding domain 
structure in eukaryotes [1]. Structural determination of ZF domains bound to 
DNA has enabled rational design of proteins to bind targeted DNA sequences 
[1]. Such engineered ZFP domains can be fused to other protein domains with 
differing capabilities to create enzymes capable of targeted cleavage of DNA and 
other targeted genomic effects [2]. ZFPs can be engineered with a high degree of 
specificity for unique genomic elements [3]. This chapter mainly focuses on the 
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Figure 3.1 Targeted genomic modification using zinc finger nucleases (ZFNs). A pair of ZFPs 
fused to the Fokl nuclease domain is designed to target opposite strands of DNA. When 
dimerization of the Fokl domain occurs following ZF binding, a double-strand break (DSB) is 
created in the DNA (shown by lightning bolt). The cell chooses to use either the 
nonhomologous end joining (NHEJ) or homologous recombination (HR) pathway to repair the 
DSB. NHEJ can be used for gene disruption via targeted mutagenesis using one pair of ZFNs or 
deletions/inversions with two pairs of ZFNs. If gene correction via HR is desired, a homologous 
template sequence is provided that will be used by the cellular HR machinery replace the 
endogenous sequence near the DSB. Alternatively, targeted gene addition at or near the site 
of the targeted DNA cleavage can be achieved by flanking the sequence to be inserted with 
homologous arms. 


use of engineered ZFPs called zinc finger nucleases for site-directed genomic 
modification through targeted DNA cleavage. When a targeted double-stranded 
DNA break is engineered, endogenous repair subsumes via either homologous 
recombination (HR) or nonhomologous end joining (NHE)) [3] (Figure 3.1). 


3.1.2 Homologous Recombination 


HR is a process of exchanging shared DNA sequences between sister chromatids. 
HR naturally occurs in a diverse range of organisms, from bacteria to humans. 
The HR process has two major purposes: (i) the protection of somatic genomes 
through DSB repair to prevent mutations that could result in cell death or cancer 
and (ii) to increase the genetic diversity of the next generation through recombi- 
nation of the parental chromosomes in each gamete during meiosis. HR has been 
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proven to be very useful for the genomic manipulation of yeast, where rates of 
HR are naturally high [4]. In contrast, mammalian cells provided with a homolo- 
gous DNA template have extremely low background rates of HR: only one in a 
million somatic cells shows evidence of HR-mediated repair following homolo- 
gous DNA introduction [5, 6]. To stimulate repair, DSBs may be introduced to 
encourage the cell to repair the DNA using the abundantly available homologous 
DNA template [7]. ZFN-induced DSBs stimulate HR at target sequences. In this 
way ZFNs may be used to introduce or correct point mutations in a seamless 
manner without additional sequences, making this strategy preferred to gene 
addition strategies when possible. Unfortunately, the frequency of HR is cell-type 
dependent and cell division is required, ruling out many potential applications 
[8, 9]. Despite the limitations, this method has been applied for genomic manip- 
ulation of many species across the kingdoms of life, including bacteria, yeast, 
plants, and mice, as well as in human cells for seamless gene correction [10]. 


3.1.3 Non-homologous End Joining 


NHE] is a system of DSB repair that acts by directly ligating the ends of linear 
DNA. If the break is staggered and homologous sequences exist on either side, an 
accurate repair can be made by sensing the homology [11]. However, if there are 
no homologous strands, NHE] can still mediate repair to stitch back together the 
DNA via the protein Ku70 [10, 12]. In this case, there is usually the gain or loss of 
a few base pairs of DNA resulting from the chemical repair of the free ends of the 
DNA. Under highly stressful conditions such as ultraviolet radiation, toxins, 
radioactivity, or desiccation, the cell could suffer multiple DSBs. In this case, 
when there are more than two free linear ends, the cell may ligate the incorrect 
ends together, resulting in chromosomal rearrangements or large deletions. Such 
major insults can result in a cancer-causing phenotype or more likely cell death. 
Even with this possibility, NHE) is still highly conserved due to the huge advan- 
tage to the organism of having a DNA repair mechanism that does not rely on the 
presence of homologous sequences on the sister chromatid. 

Gene deletion or addition can be achieved with NHE], while the elegant seam- 
less gene correction or addition strategies require HR. The majority of the cell 
types comprising an adult human, which are the cells that are most desirable to 
be targeted for correction in a gene therapy setting, prefer NHEJ over HR. In 
transfected cells, using I-Scel to induce a DSB and the sister chromatid for HR 
rarely results in gene correction, making NHEJ the easier goal to achieve. 
Demonstrating this point, a clinical trial has been completed using a ZFN to 
knock out the CCR5 receptor to block HIV infection [13, 14], while there are no 
clinical trials underway that rely upon HR-mediated gene correction. 

Either of these repair strategies can be exploited using ZFNs for targeted 
genomic modification. NHE)J can be used for targeted mutagenesis of chromo- 
somal elements [15]. HR can be used for targeted DNA repair or gene addition 
by providing a template strand of DNA homologous to the targeted site of DNA 
cleavage [16-18]. Therefore, targeted DNA cleavage can be used to achieve tar- 
geted mutagenesis, targeted DNA repair, or targeted DNA addition at specific 
genomic sites. 
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3.2 Approaches for Engineering or Acquiring Zinc 
Finger Proteins 


The most common approach for targeted genomic DNA cleavage via ZFPs is to 
use ZFNs [19-21]. Simplistically, a ZFN involves fusion of a ZFP to a nuclease 
domain via a short flexible linker sequence. The simplest ZFN combines a natu- 
rally occurring ZFP with the linker and FokI nuclease domains to target its native 
binding site [17]. However, ZFNs can also be rationally designed to target a wide 
range of sequences for a greater number of applications. 

ZF motifs are 30-amino-acid protein domains that chelate a zinc ion. They 
bind to DNA by insertion of an alpha helix into the major groove of the DNA 
to probe the DNA sequence [22]. Naturally occurring ZFs may be mutated to 
alter binding specificity [23]. The DNA sequence that will be bound is defined 
by certain amino acids [24, 25]. These ZFs can be combined into strings of 3, 
4, 5, or 6 ZFs to bind increasingly long DNA sequences to enhance the speci- 
ficity of the interaction [26-28]. The availability of motifs that recognize tri- 
plet sequences is a limiting factor in ZF design, as the ZFN pair should be 
designed such that they will create a DSB as close to the site of desired HR as 
possible [29]. 

Most restriction enzymes cleave palindromic sequences through coupled 
DNA-binding and cleavage events. The Fokl endonuclease is different in that it 
cuts DNA between two binding sites that can be 9-18bp in length on opposite 
DNA strands. FokI contains two separate domains: the N-terminal domain is 
involved in sequence recognition, while the C-terminal domain contains a nucle- 
ase [30, 31]. Fokl is unique in that single amino acid substitutions resulted in the 
decoupling of sequence recognition and cleavage [32, 33], allowing the nuclease 
domain to be isolated and fused to other DNA-binding domains. In addition, 
dimerization of the nuclease domain is required for DNA cleavage to occur [34]. 
The ZFN architecture has been improved such that cleavage by the enzyme 
requires a heterodimer to be formed, preventing the off-target events that could 
result from homodimer formation [35]. A recent study reported a multi-reporter 
selection system to identify ZFNs with high degrees of activity at the desired site 
and negligible activity at similar off-target sites in the genome [36]. Refinements 
through mutagenesis and DNA shuffling have made the Fokl cleavage domain 
15-fold more active and 6-fold more specific [37]. Therefore, ZFN pairs can be 
engineered such that DNA binding by each ZFN mediates FokI nuclease dimeri- 
zation between ZFP binding sites, resulting in targeted DNA cleavage [38, 39] 
(Figure 3.1). 

A potential design limitation when designing a ZFP is the lack of availability of 
ZF motifs to recognize every triplet sequence [29]. In order to make longer 
strings of 6 ZFs, the longer recognition site (18—19bp) must have ZFs that can 
recognize the entire sequence. There are several options available to investiga- 
tors for engineering ZFPs for this purpose, and these include modular assembly, 
a selection method termed “OPEN,” context-dependent assembly termed 
“CoDA,’ and a proprietary system available from Sigma-Aldrich. These differing 
approaches are discussed in brief later. 
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3.2.1 Modular Assembly 


Modular assembly for engineering ZFPs utilizes a combination of validated ZF 
modules that each targets a separate DNA triplet. This combination allows for 
longer DNA sequences with higher probabilities of being unique in the genome 
to be targeted [22]. These modules can be combined by drawing from toolkits 
available from Barbas [40], ToolGen [41], and Sigma-Aldrich. Modules are then 
strung together for in silico predicted targeted binding of the desired DNA 
sequence [41, 42]. These end-effect ZFP sequences can be retrieved from a web 
server that utilizes known ZF binding to DNA triplets and designs the engi- 
neered ZFP in silico. The modules can be tied together using molecular biology 
techniques or gene synthesis. A potential drawback of this method is that the 
specificity of each ZF module can depend on both the context of the surrounding 
DNA target sequence and the other protein components that it is linked to [43]. 
Along these lines, modular assembly-produced four-finger ZFs outperform 
three-finger ZFs [41]. For these reasons, in silico prediction alone is not ideal for 
most applications. Modular assembly should be combined with a selection 
method to test many predicted ZFPs to find the one with the most desirable fea- 
tures for expression and binding to the desired target sequence, such as high 
activity and specificity. 


3.2.2. OPEN and CoDA Selection Systems 


Several selection methods have been devised to address the limitations of 
in silico modular assembly by relying on screening for optimal binding capabili- 
ties from large libraries of potential ZFPs. Initially, partially randomized ZF 
arrays were screened in large pools by phage display to select for those that 
could effectively bind to the desired DNA sequence [44, 45]. Pabo’s group 
devised a successful strategy to gradually extend the ZFP by adding and optimiz- 
ing each finger individually [46]. More recently, oligomerized pool engineering, 
or “OPEN” [47], derives ZFPs from randomized libraries. Each finger in a 
three-finger ZFP was randomized and the resulting library was screened using 
low-stringency selection methods [18]. The resultant clones were then picked to 
generate a pool of potential ZFPs that was further recombined by swapping the 
fingers [18]. These randomized, then recombined, three-finger ZFPs were 
selected for the optimal combination of fingers to bind to the desired target site 
[18]. While OPEN is available to all researchers, screening the large libraries that 
are generated requires a serious time investment and some skilled knowledge of 
the components involved. This has limited the adoption of OPEN. The latest 
generation of ZFN assembly is termed context-dependent assembly [48], which 
takes into account interactions between ZFPs while using modular assembly 
[19]. The CoDA approach can be used to create an array of viable ZFP options 
for many target sites with a similar efficiency to OPEN but is easier and faster to 
use [19]. CoDA relies upon arrays of previously validated three-finger ZFPs that 
share a common middle finger and are shuffled via this homologous sequence to 
create an array [19]. All of the software and reagents required to implement 
CoDA are publicly available. The Zinc Finger Consortium offers web-based 
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tools for evaluating for ZFN target sites within a genomic DNA region for both 
OPEN and CoDA at the www.zincfingers.org website. Because OPEN and CoDA 
rely upon previously validated ZFPs, there are some sequences that cannot be 
targeted using these methods. Based on the failure rates and time investment for 
each approach, investigators should first consider CoDA, then OPEN, and then 
modular assembly only if the target sequence is unavailable through CoDA or 
OPEN. Supporting this recommendation, recent computational studies have 
suggested that binding of the ZFP to the DNA sequence is better thought of as 
synergistic rather than strictly modular [49, 50]. 


3.2.3 Purchase via Commercial Avenues 


Engineered ZFPs are also available commercially. Sangamo Therapeutics, Inc. 
developed a proprietary archive of engineered ZFs early on but has not made 
this information public, although they have published some of the details regard- 
ing their ZFN engineering platform [48]. Currently, the simplest means of 
obtaining a ZFN pair to a novel target sequence is by purchasing a custom pro- 
tein. Sangamo licensed its proprietary methodology to Sigma-Aldrich, which 
has marketed the technology as the CompoZr Zinc Finger Nuclease platform. 
Pre-validated ZFNs to the rat and mouse Rosa26 locus as well as the human 
AAVS1 safe harbor site present the most cost-effective option. These would 
allow the investigator to place transgenes at known genomic locations that will 
not interfere with genomic function, are commonly used, and are known “safe 
harbor” sites. Additionally, ZFNs to target an abundance of specific human, 
mouse, and rat genes are available at a more reasonable cost as compared with 
custom target options; the complete list of the genes is available online at www. 
sigmaaldrich.com. Custom ZFNs designed to target novel sequences require 
increased time and are produced at a much greater cost. A major advantage of 
using a commercial service to design a custom ZFN is the timeframe of delivery 
in less than 3 months. For most research investigators, use of the clustered regu- 
larly interspaced short palindromic repeat (CRISPR)/Cas9 system of targeted 
integration is now the fastest and most cost-effective method by which to initi- 
ate a nuclease-driven project [10]. Later on, purchased or designed ZFNs may be 
integrated into the molecular toolkit for intellectual property, reproducibility, or 
other experimental reasons. 


3.3. Genome Modification with Zinc Finger Nucleases 


Engineered ZFNs can be used for a variety of genome alterations. These can be 
categorized as dependent on either HR or NHEJ. HR-based alterations include 
targeted addition of DNA sequence to the genome [16-18] through introduction 
of a new sequence flanked by homologous arms or targeted base-pair changes 
achieved by supplying a homologous template with the desired alteration. NHEJ- 
based changes do not require the introduction of homologous sequences and 
include gene disruption strategies that take advantage of the infidelity of NHEJ 
repair mechanisms. Another NHE) strategy involves supplying ZFN pairs for 
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two sites to introduce two DSBs that may result in a large deletion or chromo- 
somal translocation [51]. 

HR-based methods can be used to introduce transgene sequences or small 
mutations at target sites by providing a homologous donor DNA template. 
This template should include 700bp homology arms if it is typical double- 
stranded circular DNA [27]. For linear DNA, only 50 bp of homology is required 
[19]. Single-stranded DNA oligonucleotides have also been used to achieve 
point mutagenesis, deletions, or insertions [52, 53]. As compared with HR 
methods that involve simply introducing the homologous sequence with the 
desired mutation, introducing a targeted double-stranded DNA break enhances 
the efficiency of genome editing by many orders of magnitude [18, 54-57]. 
HR-based methods do not work in every cell type, however, as they require the 
presence of the HR machinery, which is only available during the S and G2 
phases of the cell cycle just prior to mitosis. Strategies employing HR can be 
achieved at desired rates in early stem cells. However, HR-directed genomic 
modification cannot be achieved at appreciable rates in many differentiated 
cell types because these cells are not dividing and do not have the HR machin- 
ery available. This presents a major roadblock for the design of a gene therapy- 
type strategy based on ZFN-induced HR. There are also species-specific 
differences in the frequency of HR to consider: for example, mouse embryonic 
stem (ES) cells are more prone to HR and thus easier to modify than human ES 
cells [58, 59]. 

HR is an elegant and seamless method to create perfectly tailored DNA 
sequences in the genome, but many somatic cells rely on the NHEJ repair path- 
way instead. NHE)J-based gene disruption is much easier to achieve than HR, 
although the resulting mutations are not predictable. Thus far the only clinical 
trial to date using a genome modification system based on nucleases uses a ZFN 
pair to disrupt the CCRS locus [13, 14]. Since the CCR5 cell surface receptor is 
required for most HIV infection, disruption of this locus was used to create CD4 
T cells that are unable to be infected by the HIV [13, 14]. The phase I clinical trial 
indicated that patients infused with T cells that were modified via ZFN technol- 
ogy to lack functional CCR5 receptors exhibited a slower rate of decline in the 
modified CD4 T cells relative to unmodified T cells [13, 14]. Among the 12 clini- 
cal trial participants, one serious adverse event was reported of a patient suffer- 
ing fever, chills, and joint pain the day following infusion [14]. Nevertheless, the 
authors concluded that the autologous CD4 T-cell infusions were safe [14]. They 
also found that the blood level of HIV DNA decreased in most patients and one 
out of four patients tested had no detectable traces of HIV RNA in the tested 
samples, suggesting efficacy [14]. 

In addition to gene disruption via NHE]J, pairs of ZFNs may be designed to 
induce chromosomal translocations [60] by causing two simultaneous DSBs at 
desired locations. This technique could be used to study translocations that are 
important for cancer formation. The same two-DSB strategy will also produce 
deletions of up to 15 Mb [51]. These deletions could be used to remove an exon, 
an entire genomic locus, or even a number of genes from the genome. Therefore, 
multiple options exist both for achieving the desired genome modification and 
the methods by which to achieve those modifications using ZFNs. 
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3.4 Validating Zinc Finger Nuclease-Induced Genome 
Alteration and Specificity 


Methods have been developed for monitoring for endogenous gene modifica- 
tion. One such assay evaluates for DNA DSB repair via using the Surveyor nucle- 
ase [35]. This assay involves three steps: (i) polymerase chain reaction (PCR) 
amplification of the region of interest and annealing of the strands to form 
homoduplexes (no mismatches) and heteroduplexes (containing mismatches), 
(ii) cleavage of the mismatched heteroduplexes by the Surveyor nuclease, and 
(iii) fragment size evaluation to determine if mismatched DNA was present [61]. 
By only cleaving annealed complexes containing both the mutated and wild-type 
DNA after amplification, the Surveyor nuclease can used to estimate the level of 
mutagenesis mediated by the ZFNs (Figure 3.2). 
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Figure 3.2 A nuclease assay for detecting gene targeting. ZFNs are used to create a targeted 
DNA DSB. PCR is used to amplify the targeted sites and the DNA is heated and cooled to 
reanneal the strands, creating mismatched heteroduplexes of DNA. The Surveyor nuclease 
cleaves the heteroduplexes only at the sites of mismatched DNA, leaving the homodimers 
unmodified. Gel electrophoresis can then be used to observe and quantitate the efficiency of 
gene targeting using ZFNs at the genomic level. 


3.6 Zinc Finger Fusions to Transposases and Recombinases 


The Surveyor nuclease can be used to evaluate if the desired modification was 
achieved, but it cannot be used to evaluate the number of off-target events in an 
unbiased manner. The specificity of genome modification is also highly impor- 
tant, whether one is selecting a gene knockout by limiting dilution or determin- 
ing the percentage of off-target modifications in a pool to make inferences about 
gene therapy safety. To probe the specificity of the ZFP for the target DNA- 
binding site, in vitro binding profiles are experimentally derived [62]. Then, 
in silico prediction can be used to determine a number of predicted off-target 
sites to be PCR amplified. The Surveyor assay may be used to evaluate off-target 
cleavage at the predicted loci. Because this technique is limited by the successful 
prediction of the off-target binding sites, large-scale sequencing methods may 
be preferable as they provide a more comprehensive view of all off-target events 
in the genomic DNA [63, 64]. Whole-exome sequencing and next-generation 
sequencing methods have become widely available and more commonplace in 
recent years [65]. These sequencing methods permit unbiased whole-genome 
analysis of ZFN specificity. However, small numbers of off-target events may 
not be effectively found by this or any method, so practical assessment of the 
transformation and clonal expansion of treated cells may be performed by 
established methods such as a soft agar assay are also advisable for development 
of ZFN-based clinical products [66]. 


3.5 Methods for Delivering Engineered Zinc Finger 
Nucleases into Cells 


The ability to perform targeted genome modification using ZFNs is dependent 
on the delivery of the ZFNs into target cells and into the nucleus. The ZFN genes 
may be introduced into the cell by viral or nonviral methods. Nonviral meth- 
ods to transfect cells ex vivo include lipophilic reagents and electroporation. 
Electroporation of plasmid DNA or RNA has proven effective, though elec- 
troporation can be toxic to cells and is less efficient than other methods such as 
viral delivery [67]. Viral delivery has been successful, including adenovirus [59], 
integrase-defective lentivirus [68], and adeno-associated virus [69-71]. Viral 
delivery can be used for both the delivery of the ZFN and homologous DNA if 
HR-directed modifications are desired. More recently, ZFN protein has been 
shown to be capable of traversing cell membranes to achieve genome editing 
[72]. Ultimately, the delivery methodology use for ZFN-mediated genome modi- 
fication will depend on the cell type targeted and whether or not the cells will be 
modified in vitro or in vivo. 


3.6 Zinc Finger Fusions to Transposases and Recombinases 


ZFNs comprise the more characterized class of proteins for site-directed genome 
modification. However, all nucleases, including ZFNs, have serious limitations. 
Difficulties in measuring the rates of off-target DNA cleavage, the dependence 
on cellular DNA repair machinery, high levels of DSB-induced toxicity leading to 
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cell death, and the requirement for cell division are just some of the problems 
caused by inducing free DSBs in the cell. DSBs are associated with carcinogenic 
agents and can cause undesired chromosomal translocations [73], although there 
have not been any reports to date of ZFNs causing cancer. Enzymes such as 
recombinases and transposases are capable of DNA excision and integration 
autonomously, without free DSBs and their negative aspects. However, trans- 
posases require very short sequences for integration, usually 2-8 bp, making 
their integration essentially random [74]. Fusion of ZF DNA-binding domains to 
recombinases [75—77] and transposases [78, 79] has resulted in successful redi- 
rection of the integration events to varying degrees. Recombinases have built-in 
DNA specificity and thus require reengineering to target user-defined chromo- 
somal targets [75-77]. Transposase fusions do not require such engineering 
since their target sites are so short. ZF—transposase fusions are sometimes highly 
active [78, 80, 81]. However, these systems require further refinement. Firstly, 
transposase fusions have not yet demonstrated a high level of specificity in 
genomic targeting because the transposase portion of the ZFP transposes in a 
manner that is independent of the ZF DNA-binding domain. In order to increase 
specificity, one idea is to mutate the ZF—transposase fusions such that the trans- 
posase domain is kept inactive until the ZF portion binds the DNA. Secondly, 
transposase ZFPs require the presence of their short transposase target site in 
close proximity to the site recognized by the ZF [79], placing a limit on the avail- 
able target sites. Finally, despite many advances, effective engineering of a ZFP to 
target a unique genomic locus has not yet been accomplished. Attempts to target 
the checkpoint kinase-2 (CHK2), the ROSA26 locus, and the L-gulono-y-lactone 
oxidase pseudogene (GULOP) were unable to produce successful targeting in 
cells [79, 82]. Further development of proteins other than ZFNs for genomic 
targeting should lead to diverse technologies capable of site-specific gene addi- 
tion, even in cells not actively dividing. 


3.7. Conclusions 


ZFNs are a proven tool for targeting endogenous loci in the genome, while ZFPs 
have the potential for user-defined modification of chromosomal targets without 
DSBs. Over the years “open” access to ZFP engineering tools together with com- 
mercial availability led to more widespread use. However, while ZFNs and other 
nucleases began the field of targeted genome modification, one might expect for 
the focus on ZFNs to decrease in the coming years as the ease and simplicity of 
working with the CRISPR/Cas9 system displaces the older, more expensive, and 
time-consuming ZFN platform. The Cas9 system can attribute the exponential 
pace of its development to the established systems and assays that were devel- 
oped for engineering and testing ZFNs. As clinical trials usually take over a dec- 
ade to reach the clinic and the ZFNs have a different set of patents governing 
their use, it is still quite possible that ZFN-based drugs could become approved 
for therapeutic use at some point in the near future. ZFPs will continue to be 
important tools for genome engineering to ask critical biological questions as 
well as development of novel therapeutics to improve human health. Time, and 
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much research, will tell which genome engineering platform will be most fruitful 
for desired applications or therapeutic goals. 
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Engineered biological parts, devices, and systems come to life when grafted into 
a living cell. Host cells are organisms shaped by billions of years of evolution and 
characterized by high complexity, robustness, and the ability to adapt and evolve 
in response to fluctuations in their natural environment. For synthetic biological 
applications, where precise engineering of biological systems with predictable 
outputs is attempted, host cells displaying reduced complexity, higher genetic 
stability, and increased efficiency are desired. We show here that streamlining, 
the elimination of genomic regions unnecessary or counterproductive in bio- 
technological applications, is a promising way to produce host cells, which can 
outperform their natural ancestors in the less fluctuating environment of labora- 
tory settings. We focus here on the streamlining of E. coli, a primary host cell in 
research and industry. The rationale behind the streamlining process, identifica- 
tion of genomic parts targeted for elimination, deletion techniques, and results 
and applications of genome reduction projects will be presented. Current chal- 
lenges, obstacles, and possible future directions of genome streamlining will also 
be discussed. 


4.1. Introduction 


Synthetic biological constructs — genetic circuits, modules, and devices — usually 
work in the context of a living cell. The information coded in the artificial blue- 
print, and embedded in the host genome, must be maintained and expressed by 
the cellular machinery of information processing. Ideally, the new construct 
functions in a predictable way and uses the cellular resources without much 
interference with the basic physiology of the host. 

Natural host cells, even relatively simple bacteria, however, provide an 
extremely complex and frequently unpredictable environment for the synthetic 
construct, causing interference with the desired function [1, 2]. Moreover, since 
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living cells possess the intrinsic ability for physiological and genetic adaptation, 
unwanted genotypic and phenotypic alterations may arise when challenged by 
artificial genetic constructs [3-5]. 

Conveniently, recent advances in genome manipulation and synthetic DNA 
construction techniques [6-11] as well as our rapidly expanding knowledge of 
the wealth of genome sequences [12] make genome-scale engineering possible, 
and, consequently, elimination of the disadvantageous features of the host cell 
can be attempted. Rationally redesigned, streamlined, and semisynthetic cus- 
tom-made genomes could then replace naturally evolved gene sets, leading to an 
effective domestication of the microbial world [3, 11, 13-16]. 

In this chapter we will discuss the concept of the streamlined bacterial chassis, 
argue that E. coli is a primary choice for a versatile host, and review the tools and 
approaches of genome reduction. Next, results of E. coli genome streamlining 
and selected applications of the reduced-genome strains will be presented. 
Finally, future directions, gaps in our knowledge to be filled in, and perspectives 
of genome streamlining will be briefly discussed. 


4.2 The Concept of a Streamlined Chassis 


Natural cells are complex biological systems reflecting a long evolutionary 
history. The intrinsic functional robustness of natural cells, due to intertwining 
networks, functional redundancies, and feedback regulatory mechanisms make 
them resilient to synthetic reprogramming [17]. Moreover, their genomes are 
riddled with remnants of past adaptation events that may be irrelevant at present 
[18]. In addition, well-defined laboratory or industrial settings can be rather dif- 
ferent from complex and changing natural environments [19, 20], rendering the 
existing genomic capabilities partially dispensable. Even if a number of empiri- 
cally selected or purposefully introduced modifications shaped the genomes of 
some widely used experimental or industrial organisms, they still are unneces- 
sarily complex and heterogeneous biological systems with a vast number of com- 
ponents and network interactions, largely unsuitable for precise and rational 
engineering. 

Developing simple cells that provide only the very basic cellular machinery for 
maintaining and expressing designed constructs in a predefined range of condi- 
tions would thus greatly facilitate predictable engineering. Such a biological 
“chassis” could be used as a starting point to add new modules and build more 
complex systems adjusted to special needs [21]. Moreover, creating a more ame- 
nable and embraceable system would facilitate our understanding of general bio- 
logical phenomena, such as transcriptome complexity, energy metabolism, and 
robustness [22, 23]. (It should be noted that a somewhat different interpretation 
of the chassis restricts it to a DNA-less cellular container, into which synthetic 
genomes could be transplanted [21]. We use here the term for a self-sustaining 
cellular system, complete with a simple genome.) 

What are the desired features of a cellular chassis? First, it should have signifi- 
cantly reduced complexity. By eliminating unnecessary components, predict- 
ability of reprogramming could be enhanced. Second, the chassis should be 
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genetically stable. Even if mutagenesis and evolvability cannot be totally 
repressed, genetic change should be kept at the minimum to preserve the 
designed functionality. Third, by eliminating dispensable, energy-consuming 
components, the chassis should function more economically and utilize the 
resources efficiently, allowing high-yield product formation under well-defined 
conditions. In addition, the biological chassis should be safe for health and for 
the environment. By embedding genetic barriers in the blueprint, accidental 
release and genetic mixing with the natural organisms can be prevented. 

Construction of a simple cell can be attempted in two ways. On one hand, 
building genomes from scratch, using synthetic oligonucleotide assemblies is an 
approach of great potential [3]. Despite the theoretical challenges of bottom-up 
genome design, the toolbox of genome assembly and transplantation into a living 
cell is undergoing continuous development [24—26]. The grandiose project of 
synthesizing a minimal genome (a genome comprising only essential genes), 
seen for Mycoplasma mycoides, may therefore become general practice one day 
[15]. On the other hand, rational simplification and optimization of existing 
robust cells in routine laboratory use is a less challenging and less risky endeavor. 
Beyond the elimination of unnecessary genes (streamlining), creating a chassis 
might involve other modifications as well: altering the genetic code (codon 
swaps), introduction of non-interfering subsystems (orthogonality), and rede- 
sign and rewiring (optimization) [6, 21]. Here we will discuss genome streamlin- 
ing by focusing on the reduction of the E. coli genome. 

Using the term genome streamlining we do not mean creating an absolute min- 
imal set of genes required for life. Rather, the aim here is to produce a significantly 
reduced genome that retains all the important genes required for robust growth 
and easy genetic manipulation in a practical, laboratory, or industrial setting. 


4.3 The E. coliGenome 


E. coli is an important commensal and pathogen, an excellent model for research, 
and one of the most widely used industrial organisms. Among thousands of iso- 
lates, five strains (K-12, B, C, Crooks, and W) and their derivatives have been 
used extensively in laboratories for over 70 years [27]. Biotechnological applica- 
tions range from production of commodity chemicals and biofuels to vaccine 
development and bioremediation. Notably, nearly 30% of approved recombinant 
therapeutic proteins are currently produced in E. coli. 

Popularity of E. coli is owed to its versatility, simple culturability, and ease of 
genetic manipulation. £. coli can utilize a wide range of carbon and energy 
sources, is capable of aerobic growth and anaerobic fermentation, and can sur- 
vive not only in the intestinal tract but also in the outside environment. The 
versatility of the bacterium is reflected in its relatively large (4.5—5.5 Mb) [28], 
high gene-density genome. 

The genome sequence of the prototype laboratory strain K-12 MG1655 
became available in 1997 [18] (selected features shown in Figure 4.1). The 
4.6Mb chromosome contains ~4300 protein-coding genes, accounting for 
about 88% of the genome. The remaining part encodes stable RNAs (0.8%) and 


51 


52] 4 Rational Efforts to Streamline the Escherichia coli Genome 


\ 
a 
: 
48 
fo) 


Figure 4.1 Schematic map of selected features of the E. coli K-12 MG1655 genome, numbered 
on the perimeter in base pair. Outward from the center, rings depict (1) strain-specific K-12 
genomic islands longer than 4kbp [29], (2) essential genes (www.shigen.nig.ac.jp/ecoli/pec/ 
index.jsp), (3) ribosomal RNA operons, (4) IS elements, (5) prophages [18], and (6) 
macrodomains [30]. Ori and ter indicate the origin and terminus of replication, respectively. 


provides regulatory and other functions (~11%). The genome is thus nearly fully 
loaded with information-bearing sequences, leaving very little room for appar- 
ently useless, intergenic DNA with no obvious function. The largest group of 
genes codes for transport and binding proteins, reflecting the wide variety of 
substrates the bacterium can utilize. Surprisingly, despite the long laboratory 
history of E. coli, 38% of the genes had no experimentally verified function at the 
time of sequencing, and even today this number stands about 20% [31]. 

As more genome sequences of E. coli strains became available, a peculiar, 
mosaic-like genome structure was revealed. The genomes share a common, 
homologous colinear backbone sequence, interrupted by hundreds of strain- 
specific genomic islands. Typically, these genomic islands carry marks of rela- 
tively recent horizontal transfer events and are characterized by a higher than 
average number of unknown genes, mobile genetic elements, and a relatively 
high A+T content. Since the basic cellular functions seem to be coded on the 
backbone sequences and, less importantly, life style-specific genes reside on 
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the genomic islands, they are called “core genome” and “auxiliary genome,’ 
respectively. 

How many genes belong to the core genome? Obviously, the more genomes are 
compared, the smaller the core genome appears, and the core identified within a 
phylogroup is larger than the core obtained by inclusion of distant relatives. A 
comparison of 61 sequenced E. coli genomes revealed that out of a huge pan- 
genome of 15741 gene families, only 993 (6%) of the families were represented in 
every genome (core genome) [32]. The accessory genes thus make up more than 
90% of the pan-genome and about 80% of a typical genome [32]. It should be 
noted, however, that selection criteria applied to find conserved genes might 
miss homologs in distantly related strains. Moreover, alternative genetic solu- 
tions might exist for the same function. A refined comparison of 186 sequenced 
E. coli genomes [33], identifying homolog gene clusters (HGCs), revealed a pan- 
genome of 16373 HGCs. The “soft core,’ defined as all HGCs found in at least 
95% of the genomes, consisted of 3051 HGCs (Figure 4.2). A recent census, list- 
ing 2085 sequenced E. coli genomes, revealed that the pan-genome still grew 
linearly with the number of genomes added, while the size of the core genome of 
3188 gene families hardly changed [34]. 

Why do we think that a significant part of the genome is dispensable without 
loss of fitness? E. coli evolved its gene set in the lower gut of animals, with 
periodic shedding in the environment. It has obviously many genes that are 
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Figure 4.2 Comparison of the pan-genome and core-genome sizes, defined by homologous 
gene clusters (HGCs). Data and classification criteria are from [33] and are based on the 
analysis of 186 sequenced E. coli genomes. HGCs are generated by sequence similarity (95% of 
HGCs have <0.242 substitutions per site). The soft-core genome is defined as all HGCs that 
have members in at least 95% of the 186 genomes. The strict core genome is defined as all 
HGCs that have members in all genomes. The pan-genome is defined as all HGCs. 


53 


54] 4 Rational Efforts to Streamline the Escherichia coli Genome 


irrelevant under defined laboratory or industrial conditions. It was estimated 
that even under poor nutritional conditions, only 75-80% of the genes have 
detectable activity [35, 36]. Moreover, the genome is loaded with prophages and 
transposable elements (mostly residing on the accessory genome) (Figure 4.1), 
which, although occasionally contribute to fitness under certain conditions, 
could be viewed as dispensable genomic parasites. Finally, the fact that a large 
proportion of the genes lack a known function, despite of decades of E. coli 
research, suggests that they may be unimportant. 


4.4 Random versus Targeted Streamlining 


There are natural organisms possessing a nearly minimal number of genes, often 
in the range of 400-600. These organisms are typically obligate host-associated 
bacteria, and phylogenetic studies indicate that the small gene sets evolved from 
much larger genomes through massive loss of genes no longer required in the 
intracellular environment. This suggests that nutrient-rich, constant environ- 
ment and low population size favor genome reduction. It was estimated that the 
free-living ancestor of Buchnera has lost 75% of its genome since it switched to 
an endosymbiotic lifestyle approximately 200 million years ago [37]. As an anal- 
ogy, culturing a population of cells by serial passage under conditions favoring 
loss of genetic material (limiting nutrients for DNA synthesis, periodic popula- 
tion bottlenecks, defects in mismatch repair) could lead to smaller genomes 
[38-40]. Such an undirected procedure would have several advantages. First, no 
a priori knowledge of the genome is required. Second, high-fitness, rapidly grow- 
ing cells are automatically selected. Third, this approach allows the exploration 
of different orders and combinations of deletion events. Unfortunately, since 
DNA synthesis requires little energy [22], there is no strong selection for smaller 
genome per se. Experimental work along this line so far has not resulted in major 
genome reduction. The 0.05-2.5bp per genome per division deletion rate, 
obtained in an experimental evolution test with Salmonella enterica [41], is too 
low for practical application. Similarly, a long-term laboratory evolution experi- 
ment applying serial passage of E. coli cells in a single medium yielded only a few 
deletions totaling 38kb in 20000 generations [42]. Clearly, an experimental 
approach based on selectable deletion formation is needed for satisfactory results 
on a realistic time scale. An interesting approach partially fulfilled this require- 
ment. Using an engineered, composite transposon, serial random deletions were 
created in E. coli [43]. Transposon-inserted cells were selected in each cycle 
by their antibiotic resistance. Subsequent induction of an “inner” transposon 
resulted in deletion (or inversion) of a neighboring genomic segment along with 
the loss of the resistance cassette, anda new cycle could be initiated. Unfortunately, 
there are some drawbacks: only one-fourth of the transposon-inserted cells 
undergo the proper rearrangement, replica plating is needed to find the proper 
clones, small deletions are favored, and the construct leaves a 64-bp exogenous 
sequence in the genome in each cycle. In conclusion, due to lack of an adequate 
deletion selection scheme, random deletion methods are currently not applied to 
genome streamlining. Instead, targeted genome reduction schemes are favored. 


4.5 Selecting Deletion Targets 


Rational, serial construction of targeted genomic deletions requires the full 
knowledge of the genome sequence, high quality gene annotations, sufficiently 
deep knowledge of cellular physiology, and adequate engineering tools. All these 
prerequisites are fulfilled for commonly used E. coli strains. Targeted, rational 
design has several advantages: there is no deletion size constraint per se, no sub- 
sequent identification of the modifications is required, and optimal serial strat- 
egy can be devised (subdivisions of deletions can be made and subsequently 
merged). Significantly, the process can be controlled at every step: in case a dele- 
tion causes an undesired effect (e.g., loss of fitness), the actual step can be 
skipped. On the other hand, the targeted approach suffers from historical contin- 
gency: cells with only predesigned deletions, introduced in an order of limited 
variability, are being created and tested. 


4.5 Selecting Deletion Targets 


4.5.1 General Considerations 


It is not a trivial task to rationally select dispensable portions of the genome. 
The goal is to obtain a streamlined genome that still supports robust and rapid 
growth on a range of customary substrates. Usefulness of a gene, obviously, is 
context dependent, and our knowledge of the cellular and molecular network 
responses under dynamically changing environmental conditions is very limited. 
However, there are some gene categories that most likely represent negligible 
contribution to fitness under most conditions. There are several approaches 
that help identifying these targets. 


4.5.1.1 Naturally Evolved Minimal Genomes 

The small genomes of obligate symbionts and parasites can provide a template 
for a basic set of genes needed for maintaining cellular life. However, simply tak- 
ing them as a blueprint for a simple organism can be misleading. Since essential 
nutrients and protection are usually provided by the host, the 400-600 genes 
they typically harbor are not sufficient to maintain life [13]. 


4.5.1.2 Gene Essentiality Studies 

In most free-living organisms investigated, essential genes make up 10-30% of 
the genome. For E. coli, there are several large-scale gene essentiality studies 
available. High-throughput random transposon mutagenesis [44] or systematic 
gene inactivations [45] were applied to determine the subset of genes, which are 
indispensable. However, essentiality studies are not fail-proof. First, essentiality 
is a function of the environmental context. Second, both query methods might 
miss some hits. Transposon mutagenesis studies assume that a gene, which does 
not suffer an insertion event is essential, thus some genes escaping insertion by 
chance will be misqualified as essential. Moreover, single or grouped gene inac- 
tivations might not reveal redundant, but essential functions, and, conversely, 
might identify seemingly essential genes that, in fact, can be deleted in combina- 
tion with other genes. Nevertheless, the 295 genes listed as essential candidates 
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(“genes that have not been shown to be nonessential”; http://ecoliwiki.net/ 
colipedia/index.php/Essential_genes 28 May 2013) (Figure 4.1) should obviously 
be retained in the streamlining process. 


4.5.1.3. Comparative Genomics 

Genome comparisons of related strains are highly informative and probably give 
the best clues as to what to delete. Natural selection, within the genus, suppos- 
edly conserved the basic set of genes, collectively called as the core genome, 
which are needed for robust performance [33]. Interspersed, horizontally 
acquired genomic islands carrying niche-specific and parasitic genes are obvious 
choices for removal (Figure 4.1). Although non-orthologous gene displacement 
might obscure shared functions [46], genome comparisons of more distantly 
related species could also help finding deletion targets. For example, Buchnera 
sp. is thought to be a naturally minimized version of E. coli, sharing a common 
ancestor before switching to a symbiotic lifestyle. The 0.64Mbp genome of 
Buchnera could serve to identify genes common with E. coli and probably being 
important for growth. Genes unique to E. coli could then be used as a smaller 
pool to identify deletion candidates by other methods [47]. 


4.5.1.4 Insilico Models 

Genome-scale metabolic network reconstructions coupled with constraint- 
based modeling can contribute to rational strain design by predicting gene essen- 
tiality and phenotypic consequences of gene deletions in microbes. Although 
these large-scale computational models continue to be expanded and updated, 
their predictive power to quantitatively assess cellular phenotypes in streamlin- 
ing studies is still limited [48]. In particular, these models often fail to identify 
groups of metabolic genes that are individually dispensable, but jointly essential. 
The most widely used E. coli reconstruction, while covering 1366 metabolic 
genes, still contains only a subset of the full gene complement of the cell [49]. 
Integration of other cellular systems (e.g., the machineries for replication, tran- 
scription, translation, posttranslational modifications) and regulatory processes 
is needed to more accurately compute complex cellular phenotypes [50, 51]. In 
addition, there are still too many unknown gene functions to accurately build an 
in silico interaction network that covers all key cellular processes. 


4.5.1.5 Architectural Studies 

Genome streamlining does not equal simply minimizing the gene set. The mini- 
mal set of genetic information necessary to sustain a functioning cell might con- 
tain positional information as well: not only trans-acting genes but also cis-acting 
chromosomal regions might be essential. In a comprehensive study [52], the 
entire chromosome was scanned for cis-acting regions. Essential genes were 
deleted from the chromosome in the presence of complementing plasmids car- 
rying the particular gene. Surprisingly, the replication origin was found to be the 
only essential cis-acting region. Other, reportedly cis-acting regions, like dif 
(participating in resolution of replicated sister chromosomes) or migS (responsi- 
ble for the polar movement of oriC) proved to be nonessential, and removal of 
them caused only minor growth defects. 


4.5 Selecting Deletion Targets 


In conclusion, genome streamlining is in large part a trial-and-error process. 
The large number of genes with unknown functions and the complex interac- 
tions of the constituents of the cell make precise a priori assessments difficult, 
especially when synergistic effects of serial deletions are considered. Nevertheless, 
based on the general considerations and on individual assessments, some gene 
categories can be marked as primary targets for deletion. 


4.5.2 Primary Deletion Targets 


4.5.2.1 Prophages 

Strains of E. coli harbor multiple prophages or phage-related elements that may 
represent a significant fraction of the genome (typically 3—5%) (Figure 4.1). 
Prophages have a long history of coevolution with their host and seem to be well 
integrated in the host physiology. Typically, their genes code for integrases, 
lysozymes, and phage structural proteins, but they may carry metabolic and 
toxin—antitoxin functions as well. Compared with the entire genome, a higher 
than average number of prophage genes have no known function [18, 53]. 
Regarding their effect on the desired cell characteristics, prophages are Janus- 
faced. They can stimulate cell growth in certain conditions and can help the host 
to cope with a number of adverse conditions; however, under other conditions, 
their effect can be reduced growth, increased sensitivity [54], and instability [55]. 
Although most of the prophages are cryptic, normally unable to excise and 
develop infectious particles, some may excise and lyse the host upon stress [56]. 
Lytic phage development can be fatal for subsequent cultures of non-lysogenic 
strains that may be infected and destroyed [57]. Overall, removal of prophages 
and phage remnants does not seem to have adverse effects under customary 
growth conditions and may promote uniformity and stability of the culture. 


4.5.2.2 Insertion Sequences (ISs) 

Insertion sequences (ISs) are small mobile genetic elements carrying the mini- 
mal genetic information (inverted repeat ends and transposase gene) for their 
own genomic insertion [58]. Typically dozens of ISs of several different classes 
reside in the genomes of E. coli strains (Figure 4.1). ISs are important agents of 
genetic diversity and are responsible for a significant portion of the mutational 
load for the cell. While there are well-documented cases when ISs contribute 
to adaptation of the cell to specific conditions, they can generally be viewed as 
genomic parasites causing genetic instability, especially under stress [59]. 
From the practical perspective, removal of ISs significantly increases genetic 
stability without adverse effects. In fact, there are cases where presence of ISs 
prevents stabile cloning of toxic genes by mutagenesis and selection of altered 
clones [5]. 


4.5.2.3 Defense Systems 

Common restriction systems of E. coli (4sdMRS, mcrBC, and mrr) and clustered 
regularly interspaced short palindromic repeat (CRISPR) systems provide 
defense against invasive foreign genetic material [60, 61]. While these systems 
are important factors in the interplay of evolutionary forces shaping the genomes, 
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they can be a nuisance in synthetic biology constructions. Deleting them elimi- 
nates barriers to genome engineering procedures that involve the introduction of 
genetic material into a heterologous host. 


4.5.2.4 Genes of Unknown and Exotic Functions 

A significant portion of the genome codes for genes with unknown function 
(~20%) [31]. Not excluding the possibility of discovering new and important 
functions, deletion of these genes might be attempted with high confidence. 
Similarly, metabolic and transport genes associated with substrates not com- 
monly used are primary targets for removal. It might be intuitively argued that 
metabolic genes not needed under a particular condition are not expressed; 
hence deletion of them provides little gain in the economical use of resources. 
However, in fact, it was shown that, under conditions of declining carbon source 
quality, cells switch into a scavenging mode and express a variety of transport 
and metabolic genes to prepare for any substrate availability [62]. Thus, even if 
actually not used, exotic transport and metabolic genes can pose a metabolic 
burden on the cell. 


4.5.2.5 Repeat Sequences 

The largest repeat sequences of E. coli, rearrangement hot spot (Rhs) elements, 
are about 8kb in length on average and collectively constitute about 1% of the 
genome [18]. Although widespread in E. coli strains, their function is poorly 
understood [63]. Rhs elements carry dispensable genes responsible for polysac- 
charide synthesis and export and for genes with unknown functions and might 
promote RecA-dependent rearrangements of the chromosome and are thus 
undesired for synthetic biology applications. 


4.5.2.6 Virulence Factors and Surface Structures 

The commonly used E. coli K-12 MG1655 strain is non-pathogenic due to the 
lack of a type-III secretion system and haemolysin expression, in addition to an 
impaired O-antigen synthesis [18]. Nevertheless, the strain harbors a number 
of virulence-associated factors, like flagella, fimbriae, siderophores and a 
cryptic haemolysin. There is a theoretical chance that safe strains acquire 
mutations or horizontally transferred additional virulence factors that 
transform them into a pathogen. It is a cause for concern that a double point 
mutation change in the gene coding for histon-like protein HUa can turn K-12 
into an invasive strain [64]. Deletion of the genes associated with virulence 
therefore makes the cells safer. Elimination of surface structures might bring 
about other gains as well. For instance, the flagellar apparatus, not needed in a 
fermentor, consumes an estimated 1-2% of the total cellular energy. In addi- 
tion, flagella break off and regrow constantly, and these proteins, shed in the 
environment, constitute a net loss for the cell [65]. Deletion of flagellar and 
chemotaxis gene clusters might thus result in energy savings. Elimination of 
other surface structures (fimbriae, curli, lipopolysaccharide outer core, colanic 
acid capsule) could further improve the cellular economy and also reduce the 
propensity of the cell for biofilm formation [66]. 


4.6 Targeted Deletion Techniques 


4.5.2.7. Genetic Diversity-Generating Factors 

SOS-induced translesion DNA polymerases (pollI, pollV, and polV) are major 
sources of mutations in the cell [67, 68]. When DNA is damaged, these polymer- 
ases rescue cells by bypassing bulky replication blocks and, at the same time, 
introduce point mutations in the genome. Whether the repair function or the 
generation of genetic diversity is the primary function is still debated. It seems 
that in case of moderate stress, alternative repair pathways can cope with the 
damage, but translesion polymerases are nevertheless induced and generate 
mutations [69, 70]. Deletion of the genes of translesion DNA polymerases is thus 
desirable to keep evolvability of the cell at the minimum. Indeed, it was shown 
that elimination of the translesion polymerases reduces the mutation rate of 
unstressed cells and, more significantly, prevents the increase of the mutation 
rate under stress. Engineered constructs, which pose a burden on cell growth 
and are therefore prone to deterioration via mutation and selection, can be 
maintained at higher fidelity in such a stabilized host [4]. It should be noted, 
however, that in case of heavy stress and DNA damage, when more extensive 
DNA repair is needed, lack of the translesion polymerases may cause a reduction 
in fitness [70]. 


4.5.2.8 Redundant and Overlapping Functions 

There are several redundant or overlapping functions in E. coli, and deletion of 
some of them can be attempted presumably without compromising growth and 
robustness. Typical examples include DNAses, RNAses, and transport systems. 
For instance, quadruple and quintuple mutations of nucleases were applied, 
albeit at a fitness cost under certain conditions, in order to increase the stability 
of electroporated oligonucleotides, enhancing the efficiency of oligonucleo- 
tide-mediated allelic replacement procedures [71, 72]. 


4.6 Targeted Deletion Techniques 


4.6.1 General Considerations 


E. coli is usually viewed as one of the most readily amenable organism for genetic 
engineering, with an arsenal of genetic engineering tools available. However, not 
all E. coli strains can be equally well manipulated by the usual tools. Differences 
in restriction and recombination systems, variable transformation efficiency and 
antibiotic sensitivity, resistance to transducing phage, and restricted applicability 
of the counterselecting sacB—sucrose system are a few examples of potential 
obstacles. From the engineering point of view, K-12 derivatives are the best- 
suited strains. To date, nearly all serial, large-scale E. coli genome streamlining 
projects have been performed in such cell lines. 

Construction of targeted, base pair precision deletions is usually based on 
homologous recombination of dsDNA. To create a deletion, an engineered DNA 
segment, carrying a selection marker and sequences matching the flanking 
genomic regions of a desired deletion, is transformed in the cell, where exchange 
with the genomic segment takes place, catalyzed by endogenous recombinases. 
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In a second recombination event, the exogenous sequences can be excised to 
leave a markerless deletion. Repeating the steps, multiple deletions can be cumu- 
lated in the cell [13]. 

A semi-random genome reduction attempt, combining deletions derived 
from mapped transposon-inserted genomic libraries, applied site-specific 
recombinase systems (Flp/frt, Cre/lox) for the excision step [73]. The proce- 
dure, however, leaves a scar, a 34-bp recognition site in the genome, which may 
interfere with subsequent rounds of deletions. The problem can be circum- 
vented by the use of mutant recognition sites, but the scheme is complex, and 
still a scar is left behind. Most large-scale streamlining projects therefore 
used improved, general homology-based deletion methods, producing scarless 
deletions. 

Mutant target sites of site-specific recombinases, preimplanted in the genome, 
can also be used to facilitate the exchange of long DNA fragments between an 
episome and the chromosome. Using this strategy, a 126 kbp-long chromosomal 
segment was replaced with a 72 kbp synthetic DNA cassette carrying three non- 
contiguous genomic deletions. The subsequent elimination of the remaining 
loxP sites by homologous recombination and introduction of novel mutant loxP 
sites can in theory make this somewhat complicated process applicable for 
large-scale genome reduction [74]. 


4.6.2 Basic Methods and Strategies 


4.6.2.1 Circular DNA-Based Method 

Suicide plasmids, replicons multiplying only under permissive conditions, 
serve as delivery vehicles for deletion-forming DNA constructs [75, 76]. The 
plasmid carries fused homology arms (~0.5-1.0kb long DNA segments 
matching the flanking sequences of the genomic region to be deleted) for the 
first recombination event, an antibiotic resistance gene as a selection marker, 
and a gene (usually sacB) allowing counterselection [77] in the second recom- 
bination step. Integration into the genome at one side of the planned deletion 
occurs via recombination between one of the homology arms and the corre- 
sponding chromosomal sequence, catalyzed by RecA. Such co-integrates are 
selected by their antibiotic resistance under nonpermissive conditions for 
plasmid replication. Next, cells that resolved the co-integrate in a spontane- 
ous, second recombination event, are selected applying counterselection pro- 
cedures (e.g., permitting growth of only sacB cells by using sucrose-containing 
medium [78, 79]). Outcome of the procedure can be either recovery of wild 
type or formation of a scarless deletion. An advanced, more effective version 
of the method applies I-Scel cleavage (the enzyme cuts the co-integrate at a 
18-bp site [80] found exclusively in the integrated sequence) as a universal 
counterselection tool, which, at the same time, stimulates recombination [81] 
(Figure 4.3a). 
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Figure 4.3 General scheme of standard deletion procedures. (a) Overview of the circular 
DNA-based method. Boxes A and B represent >500-bp DNA segments flanking the genomic 
region to be deleted. AbR stands for an antibiotic resistance marker gene; ori indicates a 
replication origin functioning only under permissive conditions. (b) Overview of the 
A-Red-mediated, linear DNA-based deletion method. Two alternative routes for generating 
deletions are shown. A, B, and C represent arbitrarily chosen 40-60-bp DNA segments 
(homology boxes). Arrowheads represent I-Scel cleavage sites. Ab" and csm stand for an 
antibiotic resistance marker and a counterselectable gene, respectively. 
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Figure 4.3 (Continued) 


4.6.2.2 Linear DNA-Based Method 

A more straightforward method applies a polymerase chain reaction (PCR)- 
generated linear dsDNA fragment carrying flanking homology arms and genes 
for selection and counterselection (Figure 4.3b). Recombination into the 
genome is catalyzed by the lambda Red system [82], expressed typically from a 
plasmid. The second recombination step, using a linear DNA fragment com- 
posed of the flanking homology arms, is applied to replace the exogenous 
sequences, creating a scarless deletion. Homology arms can be as short as 
40 bp [83, 84] but work best when longer (1 kb; [47]). The method is straight- 
forward, and even large deletions (~100kb) can be obtained efficiently. 
Incorporating a third homology box and a I-Scel cleavage site in the targeting 
dsDNA fragment alleviates the need for recombination of a second targeting 
fragment, accelerating the procedure. Induced I-Scel cleavage of the integrated 
sequence stimulates intramolecular recombination between the third homol- 
ogy box and a matching neighboring genomic sequence, resulting in a scarless 
deletion [85](Figure 4.3b). 


4.6.2.3 Strategy for Piling Deletions 

Accumulating deletions in a cell one by one is a labor-intensive endeavor. Some 
simple strategical considerations help accelerating the process. Individual 
deletion intermediates (e.g., unresolved co-integrates carrying a selection 
marker) can be made and checked for fitness in a parallel fashion. Genomic 
segments carrying the selected deletion intermediates can then be sequentially 
transferred into the multiple deletion strain by cycles of P1 transduction and 


4.6 Targeted Deletion Techniques 


I-Scel-stimulated scarless resolution. In principle, addition of new deletions to 
the final host can be accomplished in a multiplex, iterative fashion. This might 
allow the combination of the best deletion candidates, selected due to faster 
growth. 


4.6.2.4 New Variations on Deletion Construction 

Several overlapping studies demonstrate an approach which couple the CRISPR/ 
Cas9 system with A-Red-mediated recombineering [86-90]. In contrast to pre- 
vious strategies, it does not rely on chromosomal integration and subsequent 
removal of selectable markers. Since E. coli lacks the nonhomologous end- 
joining (NHE)) repair system, double-stranded chromosomal breaks are highly 
lethal, unless rescued by providing a bridging template DNA segment. This 
strategy requires targeted double-stranded DNA cleavage by Cas9 and A-Red- 
mediated genomic integration of a homologous template DNA carrying the 
desired deletion. The donor DNA can be either single or double stranded and 
might be introduced as a plasmid or in a linear form. Chromosomal cleavage 
not only facilitates recombination but also provides strong counterselection 
against the wild-type cells; therefore the efficiency of this tool can be very high, 
up to 100%. 

CRISPR/Cas9-derived nickases were also used to generate targeted deletions 
between genomic repeats [91]. They showed that creating single-stranded chro- 
mosomal incisions by mutant Cas9 nucleases are not lethal; moreover, it facili- 
tates the intramolecular recombination between repetitive elements. 
Dual-targeted nicking in IS element repeats generated two deletions in one step, 
removing a total of 133 kbp from the genome. 

The CRISPR/Cas9 coupled with NHEJ system from mycobacteria enables 
rapid and continuous creation of large deletions without applying selection 
markers or homologous DNA template [92, 93]. First, CRISPR/Cas9-targeted 
double-stranded breaks are generated flanking the desired deletion. Next, the 
NHE] proteins seal the DNA ends in an imprecise way and thus rescue the cells. 
Using this powerful technique, deletion of a 123 kbp genomic fragment was 
demonstrated [93]. 

Another way of using CRISPR/Cas nucleases to facilitate \-Red-mediated 
genome editing is to provide long linear DNA fragments by cleaving bacte- 
rial artificial chromosomes (BACs) in vivo. Both the BAC cleavage and the 
genomic recombination processes are selected for using appropriately placed 
positive/negative selection markers. This method, referred to as replicon exci- 
sion for enhanced genome engineering through programmed recombination 
(REXER), has been used to replace a 230 kbp-long genomic segment of E. coli 
and could be a promising technique for the stepwise re-coding of the complete 
chromosome [16]. 

A related strategy referred to as multiple essential genes assembling (MEGA) 
applies the I-SceI endonuclease to release a linear DNA fragment from a circular 
plasmid in vivo [94]. The released fragment comprises all essential genes corre- 
sponding to the targeted genomic region. Subsequent cleavage of the I-Scel sites 
inserted into the chromosome generates a double-strand break that facilitates 
the replacement of the genomic region with the essential gene cluster by A-Red 
recombination. 
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Recently, rapid streamlining and genome-wide inactivation of IS elements 
were accomplished by genome shuffling between different E. coli strains, fol- 
lowed by multiplex genome modifications by CRISPR/Cas-assisted MAGE [95]. 
First, prophages were deleted by shuffling prophage-free segments of multiple 
deletion series (MDS) genomes into E. coli BL21 by P1 transduction. This was 
followed by subsequent rounds of CRISPR/Cas-assisted MAGE on multiplex IS 
targets, disrupting the transposases of the IS elements. With the growing num- 
ber of reduced-genome strains, such recycling of streamlined genomes might 
accelerate strain construction. 


4.7 Genome-Reducing Efforts and the Impact 
of Streamlining 


4.7.1 Comparative Genomics-Based Genome Stabilization 
and Improvement 


The first systematic, large-scale genome reduction project was aimed at remov- 
ing the largest K12-specific genomic islands from the MG1655 genome [85] 
(Figure 4.4). Identification of the K-islands was based on the sequence com- 
parison of three E. coli genomes available at that time (MG1655, enterohemor- 
rhagic O157:H7, and uropathogenic CFTO073). Via a series of linear 
DNA-mediated recombineering steps, including a novel way of I-Scel- 
stimulated scarless resolution of the recombination intermediate, 12 precise 
deletions were created and combined in a single strain. Compared with the 
parental MG1655, the resulting MDS12 (multiple deletion series strain with 12 
deletions) had a genome reduced by 8.1%, with 9.3% of the genes deleted. All 
prophages and 24 of the 44 transposable elements present in the MG1655 
genome were deleted. Growth rates of MDS12 in minimal and rich medium 
were similar to those of MG1655. Doubling times were nearly identical, but 
MDS12 reached 10% higher density in stationary phase. Electroporation and 
transformation efficiencies of the parental and the MDS12 strain were identi- 
cal. This first attempt of drastic genome streamlining proved that by applying 
a rational design strategy, a large fraction of the genes can be removed from an 
organism that has been shaped by billions of years of evolution. Moreover, this 
could be done without losing robustness and rapid growth, at least under the 
laboratory conditions tested. 

The next milestones of this project were the IS-free MDS41, MDS42, and 
MDS43 strains with 14.28, 14.30, and 15.27% of the genome deleted, respec- 
tively [29]. Deletion targets were primarily selected by comparative genomics of 
several sequenced strains (RS218, CFT073, Shigella flexneri 2457T, O157:H7 
EDL933, and DH10B) and by assessment of literature data on the particular 
gene functions. Major K-islands were targeted, but deletions were in several 
cases extended to include neighboring nonessential genes with no impact on 
growth in either rich or minimal media. Deletions were tested for growth prop- 
erties both individually and when combined in a single strain. Growth rates of 
the MDS cells were similar to that of the parental MG1655. Elimination of 


4.7 Genome-Reducing Efforts and the Impact of Streamlining 


Figure 4.4 Deletion map of reduced-genome E. coli strains. Rings depict features mapped to 
the genome of E£. coli K-12 MG1655, numbered on the perimeter in kilobase pair. Outward from 
the center, (1) strain-specific K-12 genomic islands longer than 4 kbp [96], (2) essential genes 
(www.shigen.nig.ac.jp/ecoli/pec/index.jsp), and (3)-(8) set of deletions constructed by 
Goryshin et al. [43], Yu et al. [73], Hashimoto et al. [97], Pésfai et al. (MDS42: black boxes, MDS69: 
black and gray boxes) [29, 85], Mizoguchi et al. (MGF-01) [47], and Hirokawa et al. (DGF-298) 
[98], respectively. Ori and ter indicate the origin and terminus of replication, respectively. 


recombinogenic or mobile DNA stabilized the MDS genomes and provided a 
host free of IS contamination for plasmid preparations and gene libraries, 
reducing the chances for cloning artifacts, solving a frequently arising but usu- 
ally overlooked problem. Many cryptic virulence genes were also removed, pre- 
sumably increasing the safety of the strains. High yields of recombinant protein 
production were achieved in MDS cells. Genome reduction also led to unantici- 
pated beneficial properties: high electroporation efficiency and accurate propa- 
gation of recombinant genes and plasmids with strong secondary structure that 
were unstable in other strains. It was demonstrated that the stability of lentiviral 
vectors containing long direct repeats was significantly enhanced in MDS42 
[99]. Genome stability was further increased by deleting the three SOS- 
inducible, error-prone DNA polymerases Polll, PolIV, and PolV [4], signifi- 
cantly reducing point-mutation rates, thereby allowing more faithful expression 
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of a heterologous toxic protein. Versions of MDS42, rendering it less recombi- 
nogenic (recA), suitable for blue-white selection cloning (lacZAM15), or 
expressing inducible T7 polymerase are also available for common applications. 
Continuing the MDS series, reduced-genome strains with up to 69 deletions 
were created by removing further putatively horizontally transferred regions 
[100].The final member of the series, MDS69 lost 965 genes, 20.3% of the 
genome. 

A novel, rapid streamlining workflow, the MDS series based on genome 
shuffling and CRISPR/Cas-assisted MAGE was developed to improve the stabil- 
ity of the E. coli BL21(DE3), a host frequently used for high-level recombinant 
protein production [101]. All 9 resident prophages were deleted and all 50 active 
IS elements were removed or inactivated. The DE3 prophage carrying an induc- 
ible T7 RNA polymerase gene was exchanged with a tightly controlled T7 RNA 
polymerase cassette. Additional strain variants with inactivated error-prone 
DNA polymerases were also constructed. The streamlined BL21(DE3)-K-12 
hybrid strains retained the favorable characteristics of BL21(DE3), displayed 
increased genomic and plasmid stability, and allowed elevated electroporation 
efficiencies [95]. 


4.7.2. Genome Reduction Based on Gene Essentiality 


In another attempt to reduce the genome of E. coli MG1655, a series of 
medium-scale and large-scale markerless deletions were constructed using linear 
targeting, DNA/A-Red-mediated recombination, and sacB-based counterselec- 
tion methods (Figure 4.4) [97]. Deletion targets were selected by excluding essen- 
tial genes and maximizing the potentially deletable chromosomal segments. 
First, many nonessential regions were removed from the chromosome. Next, by 
combining consecutive deletions, a series of mutants were constructed, lacking 
up to 29.7% (16 combined deletions) of the chromosome. Mutants with individ- 
ual deletions grew like the wild-type strain. The mutants with an increasing 
number of combined deletions, however, grew increasingly slower than the 
parental strain in rich medium. The mutant with the largest number of deletions 
(16) grew much slower than the parental strain (45.4 min vs. 26.2 min doubling 
time) and showed aberrant nucleoid morphology, as well as altered cell shape and 
size. It was concluded that the additive effect of large deletions can sometimes 
not be predicted, but the deletion of nonessential chromosome regions may be 
valuable for elucidating cellular processes governed by multiple systems. 

The interspersed nature of essential genes within bacterial chromosomes 
would normally require genome reduction projects to execute numerous 
short deletions targeting the flanked nonessential segments. To accelerate 
this process, the MEGA (see above) technique replaces long chromosomal 
stretches with short DNA cassettes comprising solely the essential genes of 
the targeted segment [94]. As a proof of principle, three regions ranging from 
80 to 205 kbp were deleted this way in the E. coli chromosome with each tar- 
get containing two to eight essential genes. The authors envisioned the step- 
wise, complete replacement of the E. coli genome with the gene set essential 
to sustain life. 


4.7 Genome-Reducing Efforts and the Impact of Streamlining 
4.7.3 Complex Streamlining Efforts Based on Growth Properties 


In a study focusing on cell growth in minimal medium, long, scarless deletions of 
E. coli W3110 were constructed (Figure 4.4) [47]. The long-term goal of the work 
is to create streamlined-genome strains, which are suitable platforms for meta- 
bolic engineering. To identify deletable chromosomal segments, the genome 
sequences of E. coli K-12 MG1655 and Buchnera sp. APS were used for compara- 
tive genomics, and genes unique for E. coli were selected. Essential genes 
reported in the PEC database (http://www.shigen.nig.ac.jp/ecoli/pec/index.jsp) 
were excluded from the deletion list. The annotations of the remaining genes were 
surveyed in databases to judge their importance for efficient growth in M9 mini- 
mal medium, and regions with more than 10 continuous deletable genes were 
chosen for deletion. Genomic deletions were made by using }-Red-mediated 
recombination and the negative selection marker sacB. Individual deletion strains 
were checked for growth in M9 minimal medium, and only the well-growing con- 
structs were chosen for further use. By combining the individual deletions, a top- 
performer strain (designated minimum genome factory 01 (MGF-01)) with 1 Mb 
total genome reduction was obtained. MGF-01 grew as fast as the parental W/3110 
strain and reached higher optical density and higher number of colony-forming 
units (CFUs) in stationary phase in minimal medium. This higher-density growth 
property emerged by superpositioning the individual deletions and might be 
caused by the lower level accumulation of growth-inhibiting acetate, presumably 
due to the elevated expression of glyoxylate shunt-related genes aceA and aceB. 
This more efficient metabolism could also be the reason MGF-01 with an 
L-threonine- producing unit integrated into the genome produced 2.4-fold higher 
amount of L-threonine than the parental strain carrying the same unit. 

The genome of MGF-01 was further reduced via step-by-step accumulation of 
additional deletions made in W3110 (Figure 4.4) [98]. Noncore regions were 
chosen for deletions. Starting with 37 individual deletions, strains with normal 
phenotypes were selected, and 10 of them were added to MGF-01 in subsequent 
cycles, generating MGF-02. Analysis of the growth phenotype of MGF-02 
revealed that deletion of gcvA encoding a positive regulator of the glycine cleav- 
age system enhanced initial growth in minimal medium. To further optimize the 
strain, two intrinsic mutations of parental MG1655, ilvG and rph-1 (causing 
valine sensitivity and partial pyrimidine starvation, respectively), were fixed both 
in MGF-01 and MGF-02, creating DGF-362 and DGF-348, respectively. Starting 
from DGF-348, further deletions were added by keeping only those without 
growth-reducing synergistic effects. The proVXW carrying region, deleted ini- 
tially, was reintroduced into the genome to fix sensitivity to high osmolarity. 
Eventually, the strain with the smallest genome (DGF-298) possessed a 2.98 Mb 
chromosome and was free from all IS elements. DGF-298 grew better in M9 
minimal medium than parental W3110 and also had higher cell yield in a simple 
medium (CSL) in fermentation. Transcriptome analysis showed that a heat- 
shock chaperone (IbpAB) and a protease for abnormal proteins (Lon) are down- 
regulated in DGF strains. The authors concluded that downregulation of the 
genes encoding chaperones and proteases is one of the factors that improve the 
fitness of DGF strains. 
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4.7.4 Additional Genome Reduction Studies 


In an early proof-of-concept work, random genomic deletions were constructed 
in MG1655 by developing a method involving repeated integration/deletion of a 
Tn5 transposon derivative. Deleted regions could be rescued on a conditionally 
replicating plasmid, allowing identification of essential genes. The extent of the 
genome reduction in the most deleted strain was about 5.6%, as estimated by 
pulsed-field electrophoresis [43] (Figure 4.4). 

Utilizing a pre-mapped transposon insertion library, a semi-random method 
was applied to reduce the genome of MG1655 by 6.7% (Figure 4.4) [73]. A pair of 
selected transposon insertions could be combined in a single cell by P1 transduc- 
tion, and the genomic region between them could be excised by the Cre/lox sys- 
tem. Combining of deletions in a single genome was also achieved by P1 
transduction. In some multiple deleted strains, synthetic lethality was observed: 
some deletions were individually viable but were lethal when combined. This 
genome engineering strategy, producing large sets of mapped transposon inser- 
tions ready for pairwise combination, followed by Cre/lox-mediated in between 
deletion, is most useful when deletion of a particular region of the genome is 
desired. 


4.8 Selected Research Applications of Streamlined- 
Genome E. coli 


4.8.1 Testing Genome Streamlining Hypotheses 


The MDS series with increasing number of genomic deletions provides a con- 
venient model for studying the impact of stepwise genome streamlining on cel- 
lular traits, addressing unsettled questions of reductive genome evolution [100]. 
A comprehensive study showed that deletions caused a gradual fitness loss, 
decreased nutrient utilization, and induced a general stress response. Growth 
yield and maintenance energy were measured in chemostat cultures of MG1655, 
MDS42, and MDS69 under nutrient limitation. Both carbon and nitrogen utili- 
zation efficiencies decreased in the multideletion strains without significantly 
affecting the maintenance energy requirement of the cell. These results argue 
against the adaptive genome streamlining hypothesis [102, 103]. Results sup- 
ported the notion that selection for reduced DNA synthesis per se is unlikely to 
reduce genome size in the course of evolution of small genomes. No general 
trend was found between growth rate and genome size, neither between cell size 
and genome size. Genome reduction was also shown to cause transcriptome 
reprogramming. Many targets of the general stress sigma factor RpoS were 
upregulated in MDS42 and MDS69. rprA, a small regulatory RNA that facilitates 
RpoS translation was strongly induced, and, as expected, the MDS42 and MDS69 
had elevated acid resistance. These studies revealed an unexpectedly significant 
role of horizontally transferred genes not only in stressful environments but also 
under routine growth conditions. 


4.8 Selected Research Applications of Streamlined-Genome E. coli 


4.8.2 Mobile Genetic Elements, Mutations, and Evolution 


Bacterial genomes are usually loaded with a great number of ISs of many types. 
The evolutionary forces driving their accumulation and their general impact on 
adaptive evolution of the host are unknown. IS-free MDS42 provides a unique 
opportunity to investigate the initial spread and evolutionary impact of ISs. By 
introducing a single IS1 element into the genome MDS42, its impact could be 
analyzed in laboratory evolutionary experiments. Although the IS element 
increased the mutational supply and contributed to adaptation, another mutator 
gene (mutS), frequently found in natural isolates, had a much greater impact on 
the evolution of the cell. Moreover, mutS cells outcompeted IS-carrying cells, 
limiting their spread. This work showed that the initial spread of IS elements 
might depend on the presence of other mutator mechanisms in the population, 
hence demonstrating the evolutionary conflict between different mutation-gen- 
erating mechanisms [104]. 

Mobile element-free strains were also used in synthetic biology studies. To 
improve the stability of synthetic genetic circuits, bidirectional (overlapping 
forward and backward) promoters were designed to couple transcription of a 
target nonessential gene to the transcription of an essential gene. The evolu- 
tionary half-life of the gene of interest increased 4—10 times, and the circuit 
was more stable in the IS-free MDS42 than in MG1655. However, eventually 
point mutations, insertions/deletions and recombination occurred even in 
MDS42, demonstrating the need for further stabilization of synthetic con- 
structs [105]. 


4.8.3. Gene Function and Network Regulation 


MDS42 proved to be especially useful in transcriptional studies elucidating the 
physiological role and the molecular mechanisms of the rho-dependent tran- 
scription termination system [54]. Rho silences foreign DNA, repressing 
prophages and other horizontally acquired portions of the genome, but this 
function becomes less important in MDS42 that lacks prophages and many hori- 
zontally transferred regions. As a consequence, MDS42 shows 10* times lower 
sensitivity to the Rho-inhibitor bicyclomycin than the ancestor MG1655. 
Moreover, Rho cofactors NusA and NusG, normally essential in E. coli, become 
dispensable in MDS42. 

Reduced-genome strains were used to identify the genes required for biofilm 
development [106]. They found new genes, some of them being cryptic in 
MG1655 but expressed in the reduced-genome mutant, discovered by this 
approach. In addition, by means of the deletion strains, a new repressor was 
identified for starvation-sensing protein RspA [107]. 

The relationship between the genomic and environmental contributions to the 
transcriptome was analyzed by comparing the transcriptomes of MG1655 and 
MDS42 grown in regular and transient heat-shock conditions. Results suggest a 
cross-talk guiding transcriptional reorganization in E. coli in response to both 
genetic and environmental disturbances [108]. 
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4.8.4 Codon Reassignment 


Incorporation of unnatural amino acids (uaas) in proteins in living cells would 
enable evolution of novel protein functions [109]. Relatively rarely used stop 
codons, coupled with orthogonal tRNA/synthetase pairs, can be exploited to 
genetically introduce uaas. A major limitation of using a stop codon to encode 
uaas is the low efficiency of incorporation due to competition of the suppressor 
tRNA with endogenous release factors (RF1 and RF2 in prokaryotes). UAG is the 
least used stop codon in E. coli (present in 7% of the genes) and is recognized by 
RF1, but not by RF2. To achieve full reassignment of UAG, the reportedly essen- 
tial RF1 must be removed from the system. It was shown that, after modifying 
the activity of RF2, the gene encoding RF1 (prfA) can be deleted from the E. coli 
genome. MDS42 was used as parental strain, because the deletion of nearly 700 
genes may alleviate the termination load imposed on RF2. Besides the demon- 
strated successes for multisite incorporations of uaas for protein research and 
laboratory evolution, the RF1 knockout strains can also be valuable for investi- 
gating the evolution of the genetic code [110]. 

Due to the degenerate nature of the genetic code, reassigning sense codons to 
encode uaas is also conceivable, once the specific codons are successfully elimi- 
nated from the genome. In a proof-of-concept work, the synonymous re-coding 
of certain Ser, Leu, or Ala codons was attempted in a 20 kbp-long essential operon 
of E. coli MDS42 [16]. Eight different re-coding schemes were tested, some of 
which resulted in the exchange of 373 codons in a single step. Measuring the 
efficiency of various codon exchanges permitted the definition of allowed and 
disallowed synonymous re-coding schemes to be applied in future codon reas- 
signment projects. 

A similar project, on the long run, aimed at the re-coding of the complete 
E. coli MDS42 genome to eliminate all 62 214 instances of seven different codons 
[111]. In this endeavor, re-coding would take place by the stepwise exchange of 
50kbp-long segments of the chromosome. Testing the complementing ability of 
the synthetic recoded DNA segments one by one, 99.5% of the recoded genes 
were found to complement their wild-type counterparts without the need of fur- 
ther optimization. The use of the MDS42 strain in such re-coding enterprises 
warrants reduced synthesis costs and improved genome stability. 


4.8.5 Genome Architecture 


As reduced-genome bacteria have altered positions and perturbed local context 
of certain chromosomal segments, these strains could be useful for studying 
genome architectural effects. A comparative protein occupancy profile of 
MG1655 and MDS42 was analyzed using microarray-based chromatin immuno- 
precipitation [112]. This work identified both highly transcribed and transcrip- 
tionally silent extended protein occupancy domains, HiEPODs and tsEPODs, 
respectively. It was suggested that the binding of tsEPODs by nucleoid proteins 
(HU, Fis, H-NS, and IHF) establishes them as chromosomal organizing centers. 
MDS42 lacks a large fraction of tsEPODs, but the remaining ones are similarly 
located as in parental MG1655, supporting a dynamic role of the organizing 
centers in the formation of a higher-order chromosome structure. 


4.9 Concluding Remarks, Challenges, and Future Directions 


4.9 Concluding Remarks, Challenges, and Future 
Directions 


Streamlined-genome E. coli strains are representatives of a promising direction 
of synthetic biology research. The goals of cell simplification have already been 
partially fulfilled. Reduced complexity arising from elimination of redundant and 
unnecessary functions helped to elucidate hitherto unknown functions and net- 
work interactions. Increased phenotypic uniformity and genetic stability can be 
exploited for maintaining unstable synthetic constructs. Demonstration of 
increased amino acid production by reduced-genome cells may hint at improve- 
ments in cellular economy. 

Numerous applications, from bacterial computation and gene network 
model building to vaccine production and plasmid biopharmaceutical manu- 
facturing, have been suggested for streamlined-genome E. coli. However, most 
tangible applications to date were research oriented, and despite all the 
advances, published biotechnological applications of streamlined-genome cells 
were limited to a few pilot studies (Table 4.1). In order to attain a more wide- 
spread status as production hosts, simplified cells clearly need improvements, 
and superior performance over traditional production strains have to be 
demonstrated. 

On one hand, construction of a superior chassis should involve not only 
streamlining but also extensive rational optimization. Introduction of muta- 
tions known to increase fitness or compensating for loss of certain genetic 
material could enhance performance. Advantageous features of different E. coli 
strains (e.g., the high recombinant protein production capability of BL21 and 
the easy genetic accessibility and high stability of K-12 MDS) could be com- 
bined in a single host [95]. New genome manipulation techniques are at hand to 
accelerate the optimization process. MAGE allows simultaneous, targeted 
introduction of small modifications at many genomic sites [121]. New DNA 


Table 4.1 Published biotechnology-related applications of streamlined-genome E. coli. 


Application References 
Recombinant protein production 113-115] 
Construction of lentiviral expression vectors 99] 
Enhanced L-threonine production 47, 116] 
Stabilized maintenance of genetic constructs 4, 5] 
Expression of avian influenza virus gene 117] 
Dengue reporter virus constructions 118] 
Periplasmic delivery of human interleukin-10 119] 
Investigation of antimicrobial peptide sensitivity 120] 
Construction of IS-free P1 phage 7] 
Incorporation of unnatural amino acids in proteins 110] 
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cleavage tools (TALENS, CRISPS/Cas), tailored to target specific genomic sites, 
might be used to devise novel schemes to rapidly perform various manipula- 
tions [8, 122-124]. 

On the other hand, rational, targeted streamlining, and optimization could be 
complemented by random engineering coupled with directed evolution. Devising 
efficient, forced random deletion-creating schemes, applying cyclic multiplex 
genomic alteration techniques, or shuffling different genomes would vastly 
increase the number of genomic variants, from which the fittest versions could 
be identified by proper selection. 

The plummeting cost of DNA synthesis is continuously increasing the rele- 
vance and reality of synthesizing streamlined genomes. Originally, bottom-up 
synthesis and top-down reduction of genomes were viewed as two competing 
and opposite approaches to simplify bacterial cells. In current practice, these 
two strategies seem to harmonically complement each other: reduced genomes 
are used as starting points of complete genetic rewiring using synthetic DNA 
cassettes [16, 111], and deletion construction has also been demonstrated with 
plasmids carrying synthetic DNA fragments [74]. Furthermore, the boundary of 
the two strategies is blurred ab ovo, for the gene sets of minimal genomes syn- 
thesized to date are all subsets of the genetic repertoire of extant bacterial strains 
[15]. It is possible, however, that in the future, minimal genomes will be synthe- 
sized by combining genes originating from multiple species. 

How far should genome size reduction extend? In general, the relatively small 
effects of the extensive genomic perturbations represented by streamlined 
genomes attest to a remarkable robustness of the cellular physiology and genome 
architecture. However, reduced complexity inevitably comes at the expense of 
robustness and adaptability to external factors [23]. Observations from practical 
genome streamlining works ([97] and our observations) also suggest that large- 
scale elimination of genes, while initially resulting in improvements, may reduce 
robustness and cause deterioration of basic cellular physiology (growth proper- 
ties, adaptability, nucleoid structure, cell morphology) beyond a certain point 
that roughly corresponds to the core-genome size (Figure 4.5). 

Beyond the complexity issue, physical constraints on genome size might 
also limit reduction efforts. Despite decades of research, little is known on the 
homeostatic mechanisms coordinating DNA replication, transcription, and 
translation to maintain a constant DNA to cell mass ratio [125]. Significantly 
altering the genome size may perturb these mechanisms. Moreover, while our 
knowledge regarding gene and network functions is getting even more com- 
plete, constraints of genome architecture per se are less understood [126]. 
The specific and relative localization of some genes (e.g., ribosomal RNA 
operons) and specific chromosomal sites (e.g., binding sites for proteins par- 
ticipating in cell division), superhelicity of the genome, and macro- and 
microdomain structure are all influenced in a largely unknown way by genome 
reduction. 

However, synthetic biology provides us just the appropriate tools to address 
these issues [9]: streamlined genomes can be specifically designed and con- 
structed to elucidate the constraints of genome size and architecture. 
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Figure 4.5 Hypothetical relationship between the fitness of the cell and the extent of genome 
streamlining. 
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A popular subject of hype, synthetic biology (SynBio), is described as a domain 
that will solve many of the catastrophic consequences of the human demographic 
explosion. Yet, the vision that stems from the engineering stance of this avatar of 
biology is seldom emphasized [1]. SynBio is uncommonly fruitful because taking 
life as an engineer would allow us to invert the classic view, where structure 
predates function, by placing function first [2, 3]. Innovation is a built-in 
consequence of engineering because it commonly originates from a top-down 
approach. It is based on functional analysis [4, 5], a methodology that endeavors 
to uncover, list, and organize the needed functions before they are implemented 
in the design of a particular contraption. This chapter illustrates the constructive 
role of engineering with a sample of SynBio-related functions relevant to the 
architecture of the genetic program connected with its associated host cell. 
Following trends developed by other investigators involved in the development 
of SynBio [6], we hope that introducing the logic of engineering will spur novel 
types of studies that will, eventually, result in successful applications of SynBio 
and more generally develop the future of biology with a fresh mind-set. 


5.1 A Prerequisite to Synthetic Biology: An Engineering 
Definition of What Life Is 


Engineering has tight relationships with what we recognize as science, created in 
Greece some three millennia ago. To see how it contributes to conceptual devel- 
opments in our contemporary understanding of life, let us briefly recapitulate 
how engineering was associated with the history of science [7]. While inventing 
writing, our predecessors began to organize the world they live in by making 
inventories: herds of animals, bushels of grain, and stars in the heaven. The out- 
come of this effort had to be organized so as to retrieve and make the best use of 
the corresponding knowledge, when and where needed. Maps of the sky, of the 
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land, and of the city were written down, drawn, and discussed. This led some to 
witness recurrent events within the records. Using repeated observations as 
marks led our forefathers to derive useful applications, in computation and in 
understanding and predicting how the world would fare. In parallel and in a 
reciprocal activity, making tools, buildings, and machines resulted from accumu- 
lated understanding of repeated features (e.g., in using metals, in particular, iron, 
at some point). In turn, this brought forth further understanding via all kinds of 
explorations, both in the concrete world and in abstraction. Engineering allowed 
those who ruled the city to tag events in time (including the regularity of days 
and seasons) and to measure the flow of time, in parallel with measuring posi- 
tions in maps and lengths in space. The way we collect huge amounts of data 
today is not without similarity with the situation in this ancient era, making it, 
again, quite fit for the development of engineering. 

Still, this activity was applied essentially to inanimate objects. Apart from 
the domestication of plants and animals, life mostly escaped the engineer’s 
hands because it was so natural: it appeared everywhere, independent of man. 
Spontaneous generation was not the exception — it was the rule. You just had to 
let a broth stand in the air to see it losing its transparency and becoming full of 
worms. The consequence is that it took very long to think that life could also be 
open to engineering. We witness a follow-up of this attitude in today’s reluctance 
of some to accept SynBio as the continuation of this prescientific attitude: after 
all, rational plant breeding has but two centuries of age (at most [8]), and we still 
witness sequels of the ancient idea that the moon directly influenced plant 
growth (see [9] for reference). Pasteur discovered that life was associated with 
dissymmetry. This led scientists to begin to see biology as a particular develop- 
ment of chemistry, at a time when the frontier between organic and inorganic 
matters had begun to vanish. La dissymétrie, cest la vie (dissymmetry, this is life!) 
declared Pasteur. Indeed this claim pointed out the existence of some efficient 
and somehow easy selection process that would trap and carry over, within living 
organisms, some of the physical dissymmetry present in the universe. Pasteur 
remained a vitalist, but Justus Liebig and Claude Bernard, each one in his own 
way, propagated the idea that chemical processes were at the root of life. In short, 
their works asked for some definition of life that would be useful for an engineer 
wishing to (re)construct a living entity. It also, unobtrusively, pointed out a role 
for information, an overlooked currency of reality. It is high time today to put 
biology in the light of engineering. 

The most successful engineering paradigm of cells and organisms is that they 
behave as machines running a program [10-14]. The basis for genetic engi- 
neering has been the development of techniques that allow investigators to 
synthesize pieces of genetic programs meant, oftentimes, to express genes into 
proteins of industrial interest [15]. In an early work, James Danielli saw that 
engineering could extend to the synthesis of life by combining individual bits 
and pieces into a functional entity [16, 17]. Despite this deconstruction/recon- 
struction procedure, it was long asserted that machine and program were inti- 
mately linked together and inseparable (see [18] for a justification of this 
negative view). 


5.2 Functional Analysis: Master Function and Helper Functions 


The onset of DNA-based technologies such as transfection of viral DNA into 
cells [19] and genetic engineering [15], associated with recognition that horizon- 
tal gene transfer made a considerable fraction of bacterial genomes [20], and 
finally whole genome transplantation [21] was a turning point. They established 
that the machine and the program are indeed separate entities, exactly as the 
operating system (OS), and the computer can be physically told from one another 
[10, 12]. It has been now possible to synthesize viral genomes in such a way that 
they comply with a man-made design [22, 23]. The statement found in rearguard 
discussions that the comparison between cells and computers is not valid because 
there is a considerable amount of information in the cell beside its genome can- 
not be retained as a final argument. Indeed, the situation is exactly the same in 
computers, human artifacts that fare well. Nobody would argue that the tablet or 
the PC do not carry a considerable body of information. The proof is that a CD 
carrying an OS is useless in the absence of the information carried by the machine 
that runs it. Yet nobody would argue against the fact that computers work, pro- 
vided they can read a support carrying a matching OS. 

Of course, this is not the whole story: besides program and machine (the “chas- 
sis” of SynBio specialists [24]), the cell, as the computer, needs to process energy, 
a feature that is not implemented in the abstract ancestor of the computer, the 
Turing machine. Furthermore, there is a need for construction and maintenance, 
which implies fluxes of matter, a currency of reality that is also absent from the 
purely informational Turing machine. In living organisms these essential func- 
tions are fulfilled by metabolism. Life can be witnessed only when metabolic 
fluxes can be measured, with “dormancy” labeling the limbo between life and 
death. In summary, life combines a program, a machine reading and expressing 
the program, and a metabolism managing matter and energy fluxes to run the 
program in the machine. Finally, a living organism works through an ultimate 
constraint: it must produce a progeny. Functions pertaining to that particular 
process make the core of the present chapter. Using functional analysis to under- 
stand the making of life, with emphasis on the processes just summarized, we 
propose here a set of developments that emphasize the mutual interaction 
between the program and the chassis, a setup essential to master for the future of 
SynBio. 


5.2 Functional Analysis: Master Function and Helper 
Functions 


The success of genome transplantation into recipient hosts — the founding exper- 
iment of next-generation SynBio [25] —is allowing scholars to look into biology 
with new eyes. To go further, we apply here the agenda of functional analysis 
[4, 5] to cells considered as “machines” or “automata,” where a program can be 
explicitly told from the machine that runs it (Figure 5.1). When trying to under- 
stand how an organism can be fit for a particular niche, we first split its biological 
functions into two functional categories, at least one master function and associ- 
ated helper functions meant to achieve the target of the master function [26]. 
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Figure 5.1 A schematic view of functional analysis [5]. Master and helper functions are as 
defined in the text. 


Illustrated in a human artifact, the master function of a printer would be to print 
documents. The associated helper functions would be supplying paper, ink, 
electric power, and so on. Other helper functions would correspond to the design 
of the printer’s chassis. When cells are envisioned as factories, their designed 
master function is production of some compounds. However this is entirely 
dependent on the ability of cells to multiply while replicating both their own 
program and the proper SynBio program construct, thus yoking the human con- 
struct to the cell’s master function (multiplying). 

While it is somewhat difficult to identify it without ambiguity, living organ- 
isms appear to display two intertwined master functions. The most obvious one 
is “to make a progeny.” A myriad of helper functions have evolved to allow this 
master function to operate, and the huge variety of living organisms reflects this 
situation. This perspective (master function/helper functions split), however, 
remains fairly open. For most (this is the common view), “propagating life” is the 
destination of life. However, we must consider an alternate view, where “explora- 
tion” would be the master function, with “propagating life” as the immediately 
downstream helper function to that particular master function. Life would thus 
be a particular physicochemical process carrying further the intrinsic propensity 
for exploration carried over by all entities present in the universe (following the 
second law of thermodynamics that tells that physical systems will tend to 
occupy as many space and energy states as they can). Here, we favored the first 
choice, avoiding innovation—a major consequence of exploration—as a core 
property of SynBio constructs: who would like to fly in a plane that could modify 
its wings and engines in flight? We consider in what follows that the most general 
goal of SynBio is to make a reproducible automaton meant to produce com- 
pounds of preset design. This ranks exploration as a helper function that gener- 
ally must be placed under command in SynBio constructs, and possibly even 
totally inactivated. We note however that our approach is operational and not 
directly linked to the concept of fitness that would entail a complementary dis- 
cussion [27]. It is likely that the next decade will witness hot debates in this 
domain. 
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5.3 ALife-Specific Master Function: Building Up a Progeny 


Life perpetuates itself. A sterile organism may still be alive, but it misses a key 
property of life in that it does not have a progeny. Indeed, its very existence is 
simply borrowing time: maintenance of a machine linked to a program, both 
doomed to age and die, can hardly allow long-term survival in an ever-changing 
environment (for a discussion, see [28] and references therein). Some animal 
societies have classes of sterile individuals, but they are always firmly connected 
to a fertile lineage. If life were only composed of infertile individuals, it would 
already be extinct, unless there existed a steady and speedy process of spontane- 
ous generation with a creation time shorter than the life-span of individual 
organisms. This is more than unlikely and does not, anyhow, fit with the chemis- 
try of life as we know it. We will therefore accept that life is tightly coupled to the 
making of a (young) progeny. 

Considering this process, we can see that the ultimate destination of the genetic 
program is to make a copy of itself within a copy of the machine that runs the 
program. “Copy” here must be defined. How are the processes of program copy- 
ing and that of cell copying linked together? Remarkably, the actual concrete 
copying process differs whether dealing with the program, or with the machine: 
the program is replicated in most of the cell’s progeny (i.e., it makes exact copies 
of itself), while the machine’s future is much sloppier, wherein it is only repro- 
duced (i.e., it makes similar copies of itself) [27, 29]. To this dichotomy two time 
scales are associated: replication is trustworthy for many generations, while 
reproduction makes copies that vary rapidly over time. Genome transplantation 
experiments, such as those using synthetic genomes [25], give us a vivid illustra- 
tion of this functional dichotomy. Extracted at the end of the experiment and 
sequenced, the synthetic genome of the bacteria in the recovered colonies is 
identical to that which has been transplanted in the host. By contrast, the 
machinery, and even the cell’s shape, differs in the initial host and in the cells 
making the final colonies (Figure 5.2). In terms of engineering, this is somewhat 
unusual, although we all know of man-made devices that have been progressively 
modified, as was Theseus boat (that did not keep a single original of its boards 
after some time [11]). The parent machine has aged, and its components have 
been replaced by new ones. In the transplantation experiment, this regeneration 
process required the use of a new program, differing from the parent one that 
had been destroyed, thanks to an astute genetic design [21]. As a consequence, 
during multiplication, the program that was used is that of the transplanted 
genome, directing the synthesis of entities that differ from those of the initial 
host machine. 

This state of affairs is far more general than that in the transplantation experi- 
ment: as in any life form, the components of any SynBio construct age and are 
replaced; in parallel, the environment changes and some components are no 
longer required and are diluted out while others are expressed. In short, while 
the program may remain the same, the machine that runs it is quite variable. It 
keeps however its main functional (abstract) properties: reading and expressing 
the program, and directing the construction of a progeny, while monitoring the 
state of the environment, extracting proper resources and discarding useless or 
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Figure 5.2 Replication of the program, reproduction of the chassis. The sequence of the 
Mycoplasma mycoides DNA transplanted into Mycoplasma capricolum is identical to that at the 
end of the experiment. Transplantation has triggered degradation of the M. capricolum 

DNA, while the DNA of M. mycoides replicates and dilutes out the component of the initial 
host. At the end of the experiment, the components of the cells are identical to those of 

M. mycoides, not to those of the recipient M. capricolum. 


worn-out components. The relationship between the machine and the program 
is central to this essential interaction. This situation is also common in contem- 
porary computers, which remember our past actions and do not behave today as 
they did some time ago, improving their adaptation to our wishes as time elapses. 
In cells, this corresponds to exploiting an information that is not directly present 
in the genes, but, rather, to a contextual information present in the way genes are 
placed (and sometimes tagged by specific biochemical processes) in the genome 
and its disposition within the cell as well as in the ultimate matter making the 
genome. We note in passing that the transplantation founding experiments tells 
us something more in terms of functions. It uncovers the first hidden functional 
constraint on the genome structure: the chromosome needs to be compacted to 
fit a small volume, and this is why (Figure 5.2) the transplantation experiment 
requires as a first step the making of a syncytium to accommodate a decondensed 
DNA molecule [27]. 


5.4 Helper Functions 
For life to keep going and to develop into a descent, a chore of helper functions 


is needed. These functions operate at different levels. They are organized along 
hierarchies that are segmented (like organs in an animal body) or branched 


5.4 Helper Functions 


(like trunk, branches, and leaves of a tree). To take this hierarchical view into 
account, the functional part of the gene ontology effort has endeavored to 
organize functional data as well as structural data [30]. The outcome of this 
remarkable effort has still to be considerably improved, and it is likely that 
much innovation will appear there in the near future [31]. 

Just to name a few helper functions, we find excerpts from an unlimited list: a 
way to go forward and uncover unexpected functions would be to use the list of 
all the verbs present in a particular language: 


Making a progeny, with associated functions: 

Construction of biomass 

Replication of the program 

Division (separating the progeny from the parent organism) 

Maintenance 

Making the progeny young, that is, separating between young and aged entities 


Exploration is the function that could also be considered as a master function 
for all living organisms, as life is doomed to explore its environment. It implies 
either harnessing movements of the environment (spores or seeds propagated by 
wind), constructing appendices allowing the organism to move (flagella in 
microbes, limbs in animals), or harnessing features of the environment to move- 
ments (light with phycobilisomes, magnetic field with magnetosomes, etc.). 

Each one of these functions is achieved using a lower level of helper functions, 
some of them universally required for replication and reproduction, while others 
are used by the organism for moving and occupying a particular niche [32]: 


Transport (in and out): Extraction of chemical compounds from the environment, 
and getting rid of waste 

Circulation 

Sensing 

Management of energy 

Storing 

Shaping and maintenance of the cell structures 

Degradation/resynthesis 

Protection 


To make this analysis explicit and concrete, we explore now some of the topics 
that are central to place the genome in the cell’s context. Let us split some of 
these helper functions along the dominating contribution of each of the five uni- 
versal currencies of reality: matter, energy, space, time, and information, with 
emphasis on constraints on the genome (including its assembly). 


5.4.1 Matter: Building Blocks and Structures (with Emphasis on DNA) 


Formation of a cell begins with metabolism. For decades, this topic was consid- 
ered as a boring haphazard collection of chemicals. It was taught in classes where 
students tried to remember by heart the lists of compounds, not really under- 
stood as following much logic. Yet, there is a clear logic of metabolism that begins 
to be deciphered [33]. This logic is consistent in terms of the physics of matter. 
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However, the local stability and relevance of retained pathways is not optimized 
in terms of what would be a human engineering design. This implies that optimi- 
zation — in terms of sparing energy and matter — will be at the core of next-decade 
metabolic engineering. To describe this logic would ask for a whole textbook, 
and we will just enumerate here some of the rules that are beginning to emerge. 
As the material support of the genetic program, DNA synthesis from nucleotides 
required for replication of the genome will be described in some details. 

The atoms of life are not random: carbon, hydrogen, nitrogen, and oxygen 
compose living organisms because they are prone to combine via chains of cova- 
lent bonds that are stable at the temperature of the Earth’s surface; heavier ele- 
ments would not retain this property in general. Sulfur (together with iron) is 
added to the list because of constraints well understood in scenarios of the origin 
of life [34]. This atom is also a remarkably versatile support for electron transfers. 
It exists in biological compounds in redox states going from —2 to +6, and this 
useful property made that it has been retained in the course of evolution [35]. 
Phosphorus is unique when combined to oxygen, as phosphate bonds are prone 
to hydrolyze (hence easy to disrupt in water), yet metastable (hydrolysis gener- 
ally requires a large activation energy). This property, which allows phosphate 
compounds to store energy, is the reason phosphorus belongs to the core atoms 
of life on Earth [36]. This constraint is essential to remember when looking for 
xenologous BioBricks meant to construct genetic programs for xenobiology. The 
role of phosphates provides us with a strong argument in favor of the engineering 
stance. Had the way engineers think be favored, arsenic would never have been 
considered as a substitute for phosphorus [37]. 

Phosphorus is a core component of nucleic acids, enabling a specific metabolic 
driving force associated with hydrolysis of pyrophosphate (polymerization of 
nucleotides is reversible; therefore going forward requires an irreversible step). 
Furthermore, the organization of phosphate metabolism drives the nucleotide 
composition of the genome in a way that is not still completely understood. 
Indeed, deoxyribonucleotides are essentially synthesized from the ribonucleo- 
side diphosphates, not triphosphates. This constraint is likely derived from the 
selection pressure that uses a metabolism developed in the three-dimensional 
environment, for synthesis of a linear molecule. This has a remarkable conse- 
quence for pyrimidines, as their anabolic pathway produces uridine diphosphate 
(UDP), but not cytidine diphosphate (CDP). This should lead to deoxyuridine 
diphosphate (dUDP) and then deoxyuridine triphosphate (dUTP), while input of 
U in DNA must be avoided at all costs via a complex set of pathways. Missing 
CDP would require an indirect process to make deoxycytidine diphosphate 
(dCDP) and then deoxycytidine triphosphate (dCTP) [38] (Figure 5.3). The con- 
sequence of this imbalance is that, in most cases, the genetic program tends to 
be progressively enriched in A+T nucleotides [39, 40]. The degradosome (with 
its exosome counterpart in Eukarya) is the machinery that resolves this hurdle. 
It allows buffering and equilibration of nucleic acids composition via degrada- 
tion of RNAs by phosphorolysis (directly producing the much wanted nucleo- 
side diphosphates (NDPs), in particular CDP, that are the precursors required 
for DNA replication) while coupling the fluxes of nucleotides with energy 
resources [38, 39, 41]. Furthermore, the physical relationship between phosphate 
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Figure 5.3 Excerpt of the metabolism of pyrimidines and DNA synthesis. The building blocks 
for DNA stem from NDPs, not nucleoside triphosphates (NTPs). This creates an imbalance in 
the case of cytosine, because CDP is not produced during the de novo synthesis. This explains 
why, in general, C is the limiting nucleotide, driving A+T enrichment of the genome in most 
situations. Nucleoside diphosphokinase is reversible; however ATP is in excess over adenosine 
diphosphate (ADP), so that production of CDP is limiting via this route. CDP comes mainly 
from mRNA turnover via phosphorolysis (polynucleotide phosphorylase) or RNase activity, 
with further phosphorylation using cytidylate kinase. 


metabolism, replication, and transcription is likely to have considerable bearing 
on the genome organization within the cell (it is at the root of the preservation 
of a nucleus in eukaryotes). 

The genome backbone phosphate is not strictly universal. Some organisms 
use a variant where the usual phosphate group is phosphorothiolated [42]. This 
modification, which can be used for specific recognition/folding processes and 
provides the cells with a protection against oxidative stress [43], is a first hint that 
SynBio could evolve toward xenobiology (i.e., the use of nonstandard building 
blocks for the construction of synthetic cells [44]). A further indication of this 
possibility is the presence of diaminopurine instead of adenine is some cyano- 
phages [45]. Finally, de Crécy-Lagard and coworkers showed that a 7-deazapu- 
rine derivative can replace guanine in functional DNA [46]. Knowing that DNA 
methylation can be used to control gene expression [47], the idea that other 
modifications may have a similar role is straightforward. A track for the future 
analysis of the distribution of phosphorothiolated sites or input of 7-deazagua- 
nines has not yet been undertaken, and their role in gene expression is not 
known. In general, there is still considerable room for exploring the presence and 
role of nucleic acid modifications [48]. 

Amino acids make the primary sequence of proteins, while many more exist in 
metabolic pathways (be it only as the result of catabolism of posttranslationally 
modified proteins). Proteinogenic amino acids are far from random, however, as 
several are fairly easy to synthesize (the smallest ones) and can be converted into 
one another at low energy cost [49], while they are split into three major physico- 
chemical properties highly relevant to water as a universal albeit physically unusual 
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solvent, essential for the development of life as we know it [50]. Proteinogenic 
amino acids are hydrophilic, amphiphilic, or hydrophobic. Surprisingly, proline 
is not an amino acid but an imino acid, and this has considerable consequences 
for translation, with requirement of a specific elongation factor [51]. As building 
blocks of proteins, these molecules need to be activated as aminoacyl adenylates 
and loaded onto the 3’OH extremity of the ribose of an acceptor tRNA molecule. 
This process introduces considerable constraints in the selection of amino acids 
relevant to translation: for example, ornithine, homoserine, and homocysteine 
will cyclize during the process and create toxic dead-end compounds [35, 52]. 
Norleucine and selenomethionine can substitute for methionine [53], and this 
changes activity only in a restricted number of proteins but is deleterious when 
methionine is limiting as the methionine side chain is specifically used in some 
metalloproteins, for example [54]. 2-Aminobutyrate is an analog of cysteine, and 
its concentration must be stringently controlled as it mimics cysteine metabolism. 
Furthermore, the presence of non-proteinogenic amino acids in cells implies that 
they are prone to affect negatively translation accuracy via their wrong incorpo- 
ration into proteins. Hence the cell must cope with this hurdle either by main- 
taining a very low level of non-proteinogenic amino acids or by modifying them 
(generally by N-acylation and sometimes N-methylation) so that they do not 
enter the wrong pathway [55]. 

Interestingly, this ubiquitous protection pathway (in the sense given by engi- 
neers in organic chemistry) is likely to have been recruited for other helper 
functions such as further protection or regulation. For example, ribosomal pro- 
teins are generally acylated, but the exact function of the modifications remains 
unknown, except for an obvious coupling with metabolism and a protective 
role for reactive amines [56, 57]. In the case of nucleotides, it may well be that 
formation of the triphosphates, besides providing a way to drive forward bio- 
syntheses via pyrophosphate hydrolysis, plays also the role of a recognition 
group, resulting in the selection of a subset of nucleotides for insertion in 
polynucleotides. 

Another mode of metabolism organization derives from a distinctive match 
between matter and space constraints. Carbon chemistry allows formation of a 
considerable number of specific stereoisomers (remember Pasteur’s exclama- 
tion) that are recognized by enzyme cavities in a highly space-constrained way. 
Proteinogenic amino acids are of the L-type. As a consequence most proteases 
and peptidases are active on chains made of these stereoisomers. This opened 
up the possibility of a selection pressure, leading to protease-insensitive protec- 
tive structures that evolved toward containing the p-isomers (e.g., antibiotics 
[58]). Another most important selection pressure is on compounds that are sim- 
ilar to amino acids, with a hydroxyl group in the place of the alpha-amino group, 
making them good mimics of amino acids. For example, glycerate is quite simi- 
lar to serine and could take its place in many enzymes, an unwanted stereo- 
chemical toxic property. D-Glycerate is therefore the preferred stereoisomer 
[59]. This has consequences on the make-up of nucleic acids. Because of the link 
between the latter metabolite and those involved in glycolysis/gluconeogenesis, 
this stereochemical constraint explains why most biologically relevant carbo- 
hydrates, ribose, and deoxyribose, in particular, are of the p-isomer type [60]. 
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This metabolic constraint must be taken into account when planning to derive 
novel nucleic acid analogs for next-generation SynBio. 

At a more integrated level of the hierarchical organization of life, multicellular 
organisms have developed an extraordinary diversity of macromolecular materi- 
als that work as frames, protectants, buffers, motors, signals, traps, and so on. 
DNA itself is known to belong to the structural polyanionic polymers, as it is, for 
example, a component of biofilms [61], which introduces a fitness property that 
has nothing to do with its coding capacity. We have seen that there is anyway a 
significant selection pressure to increase its length in order to match its synthesis 
with that of the bulk of the cell (see Section 5.4.3). Exploration of chemical diver- 
sity both in terms of small metabolites (see, e.g., [62—64]) and in terms of macro- 
molecules is expected to develop considerably (see, e.g., [65-67]) in the next 
decade. The corresponding genetic program implementation within the cell will 
need to be explored in depth. Here again thinking as an engineer will come as an 
asset for innovation. 

The list of engineering constraints on the matter used in living organisms is 
unlimited. The examples presented earlier are just meant to illustrate the way we 
should presently consider metabolism. In another dimension, metabolites of 
industrial interest, such as isobutene, are and will be produced by reprogram- 
ming and setting up synthetic pathways [68]. This will entail production of mol- 
ecules that may react with components of existing cell components, including 
DNA. To end up with high yields, metabolic engineering will need a deep reflec- 
tion on metabolite reactivity within the confined medium of the cell, a topic that 
has mostly been restricted to the study of reactive oxygen species [69]. In par- 
ticular it seems obvious that the chromosome must be protected, as much as 
possible, against reactive metabolic intermediates (we saw previously that phos- 
phorothioation is a solution uncovered during evolution). Management of waste 
will also be a major topic to be developed (be it only to limit carbon dioxide 
production). 


5.4.2 Energy 


Management of energy is central to life. It has long been established to be associ- 
ated with electron and proton transfers and with storage as energy-rich phosphate 
bonds. The motto “better lose energy than control” seems to dominate life. Much 
is known about the energy-related processes, but much also remains to be under- 
stood in terms of optimization. For example, despite their ubiquity (all cells con- 
tain these compounds), the role of polyphosphates has seldom been considered, 
despite their importance for energy management, regulation, and storage [70-72]. 
It seems likely that their contribution as the ultimate energy source (polyphos- 
phates are minerals, hence particularly resistant to desiccation, radiations, and 
harsh environments) needs to be reconsidered, especially in terms of synthesis 
and usage during transition states, aging, and stresses [28]. Nucleic acids are also 
energy-rich molecules. This implies that some regions of DNA might in fact 
have a role as energy stores, besides their expected role in space management, 
gene coding, and regulation. In this respect, some organisms have a genome of 
huge size [73], without any apparent direct link with its coding capacity. 
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Novel features of electron transfers are related to the way protons and elec- 
trons can be transported within and between cells. A considerable effort has yet 
to be devoted to the production of hydrogen as an energy store [74—76]. Research 
on microbial fuel cells is expected to develop dramatically [77-79]. It has now 
been recognized that cells can make wires that conduct electricity, creating an 
entirely new field for management and extraction of energy via living systems 
[80-82]. Some cells make large syncytia, which requires management of the 
genome DNA and energy sources (polyphosphate in particular) in a way that is 
not yet understood [71]. The role of membranes is essential in building up and 
maintaining the electrochemical potential of the cell via vectorial transport. In 
the same way energy is stored in a variety of polymer compounds such as lipid 
droplets, carbohydrate polymers, polyphosphates, and so on. This introduces a 
specific link between energy and space, the role of which we now discuss. 


5.4.3. Managing Space 


In the cell, the three dimensions of space play together in a concerted fashion. 
The genetic program is stored by a molecule of DNA that can be considered as 
linear in the way it maintains its coding capacity; membranes organize space in 
two dimensions; finally the interior of the cell is three-dimensional. In terms of 
biosyntheses this has consequences that are seldom taken into account. Filling 
up the cytoplasm with proteins as the cell grows requires an increase as the cube 
of the cell’s size (if the cell is spherical, less when it is of another shape) while 
placing proteins in the membrane would go as the square of the cell’s size. This 
discrepancy introduces a considerable constraint on the length of the genome. It 
cannot be too short, which implies that despite a selective tendency to stream- 
line the genome sequence because of the cost to maintain functional genes, there 
is an opposite tendency to fill it in with extra DNA sequences. Amplification of 
insertion sequences or similar structures and horizontal gene transfer can com- 
pensate for deletions. Overall insertions and deletions create an equilibrium that 
results in an optimum length, where the DNA length is considerably longer than 
that of the cell. Indeed, 4’,6-diamidino-2-phenylindole (DAPI) staining shows 
that the genome in itself occupies a significant proportion of the cell’s volume 
[83], shaping it more like a three-dimensional structure, via folding into a Peano 
curve-like space-filling setup [84]. This constraint is likely to be important in the 
gene flow that maintains a particular genome length [85]. 

Chromosome DNA folds can be classified into three categories [86]: short 
range, of up to 16kb (fitting with the local bias in codon usage [87]); medium 
range, over 100-125 kb (fitting with old observations of supercoiled DNA loops 
upon mild cell lysis [88]); and long range, over 600-800 kb (fitting with the size 
of the shortest bacterial chromosomes and associated with macrodomains [89, 
90]). In Eukarya, the problem of the various space scales has been solved by the 
preservation of a nucleus accommodating the genome in a space much smaller in 
general than the size of the cell and multiplying membranes (in particular the 
endoplasmic reticulum) to couple protein synthesis with occupation of the cyto- 
plasmic space. Chromosome folding requirements appear to impose sequence 
constraints that create ubiquitous 11bp periodic patterns, the “class A flexible 
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patterns” (previously identified as their core sequence, ApA dinucleotides 
[91, 92]), spanning 5-10 helix turns, and present every few kilobases of the whole 
genome. These patterns have been repeatedly observed and usually thought to 
result from the local enrichment of ApA dinucleotides, but likely to result from 
more complex patterns [93-95]. This constraint is so strong that it appears to 
bias the nature up to one in five nucleotides in the genome [95]. In general A- 
tracts have been related to DNA curvature, and they are expected to play a con- 
siderable role in DNA compaction and regulation of gene expression [96-98]. 
The importance of this feature has not yet been explored in SynBio constructs. 

Managing space is also essential to organize gene expression (references in 
[84, 90, 99, 100]). Indeed, while the DNA molecule is a linear structure, the mem- 
brane is a 2D structure and the cytoplasm is a 3D structure; they all need to work 
in concert. Allowing coordination of the different space scales, gene expression 
[101] and distribution of genes within transcription units are finely tuned in 
most Bacteria and Archaea, in particular in terms of coordination of metabolic 
fluxes [84, 102]. For example, in the lactose operon, the gene for cytoplasmic 
beta-galactosidase, lacZ, is separated from that of the membrane protein lactose 
permease, lacY, by a regulatory transcription attenuator. This results in consid- 
erably less expression of the distal genes /acY and lacA, as compared with that of 
lacZ, and allows matching the production level of the cytoplasmic enzyme with 
that of the membrane transport protein [103]. In general there is a relationship 
between the genome organization and the pattern of transcripts and protein dis- 
tribution in the cell [86, 104]. 

The genome DNA is considerably longer than that of the cell, and this allows 
folding of the chromosome in a way that can compensate for the one dimension/ 
three dimensions dichotomy. Furthermore there seems to exist a relationship 
between the overall cell architecture and that of the genome; Tamames and cow- 
orkers found a remarkable correlation between the distribution of genes in the 
mur-fts gene clusters and the overall shape of the cell [105, 106]. This observa- 
tion may fit with the view that transcripts are systematically distributed in spe- 
cific regions of the cell, as shown by local biases in codon usage, forming islands 
10-30kb long [87] in agreement with the data reviewed by Willenbrock and 
Ussery [100]. In general, analysis of the folding of the chromosome revealed the 
existence of a core structure linking together between 12 and 80 loops per chro- 
mosome [88, 107]. Many studies have explored the role of the distribution of the 
genes in the bacterial chromosome, in particular with the prospect of improving 
gene expression in biotechnological constructs (see [108] for further references). 
Despite the widespread view of the chromosome as extremely plastic, it rapidly 
appeared that while some regions were prone to harbor a variety of genes, others 
remained fairly constant. Indeed, macrodomains organization appears to display 
rigid constraints that limit genome plasticity [109]. This was further illustrated 
with the comparison between a large number of Escherichia coli strains 
[110, 111]. It was also found that functionally related genes clustered together 
into islands in a way that should have considerable impact on gene expression 
[100, 108, 112]. 

The two extremes of gene distribution are clustering and its opposite, uni- 
form distribution (which creates an apparently periodical distribution, so that 
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noticing a period should not be taken as particularly significant). Steady random 
insertion of genes via horizontal gene transfer will go toward creating a uniform 
distribution of the genes that are most important for the cell life, while frequent 
deletion will tend to make them cluster together [85]. Another well-identified 
constraint in the genome of fast-growing bacteria results from the fact that 
genes located near the origin of replication will tend to be in higher copy num- 
ber in the growing cell as compared with genes located near the terminus of 
replication [90]. This difference is also reflected in the distribution of codon 
biases classes [87]. When long enough, the bacterial chromosome is further 
organized into macrodomains that are insulated from one another and are 
essential for genome packaging [89, 113-115]. The presence of plasmids or sev- 
eral chromosomes alters this distribution [116]. Finally, there is a significant 
pressure for important genes to be transcribed from the leading replication 
strand in order to avoid transcription/replication conflicts [117]. Knowledge of 
these organization constraints is essential for optimizing gene placing in SynBio 
constructs. 

Management of space is further associated with several kinds of functional 
structures, exoskeleton and endoskeleton, scaffolds, and contractile proteins 
such as actins and myosins in the cytoplasm. How do the corresponding macro- 
molecules know where and when to go as the cell grows, changes its shape and 
eventually divides? In this context, it was revealing to discover that Bacteria and 
Archaea were not different from Eukarya, having a variety of structuring pro- 
teins, often associated with the inner membrane and contributing to the overall 
shape and functional properties of the cell [118]. As a common feature, the 
prokaryotic and eukaryotic cytoskeleton proteins couple energy requirement, 
via adenosine triphosphate (ATP) and/or guanosine triphosphate (GTP) utiliza- 
tion in active (energy-requiring) mechanisms to effect structuring functions and 
manage movements. The corresponding logic of engineering design has yet to be 
uncovered. A family of proteins, the structural maintenance of chromosome 
(SMC) proteins, manages chromosome spatial arrangement and replication, at 
the expense of energy [119]. As another example of versatile functional design, 
membrane protein topology is coupled to functional addressing, with recently 
recognized proteins with dual topology [120]. Interestingly, there is a coupling 
between genome evolution and these proteins: genes in families containing dual- 
topology candidates occur in genomes either as pairs or as singletons, and gene 
pairs encode two oppositely oriented proteins whereas singletons encode dual- 
topology candidates [121]. 

Finally, getting in and out of the cell is essential: the cell has to manage the 
influx of compounds used to construct biomass and create energy. It has also to 
dispose of waste. These processes occur at the membrane, using a variety of 
structures. Often, the cell has to extract useful compounds from an environment 
where they are considerably diluted. This requires an energy-dependent active 
transport that concentrates molecules up to a thousand-fold or more. This essen- 
tial engineering process has a trade-off: if the outside concentration of the com- 
pound increases suddenly, the influx will build up an unbearable osmotic 
pressure that will require coupling modification of the influx molecule and safety 
valves in order to prevent the breakup of the membrane [122, 123]. 
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Another question that must be answered is the way the influx of protons is 
distributed within the cell. As the membrane-associated rotor of ATP synthase 
or the flagella motor leaks in protons at a fast rate, their influx must be coupled 
to a steady average amount of “free” protons in the cell that is extremely low 
(typically, if there were such a pH as 7.6 in an E. coli cell, this would mean about 
15 free protons per cell at any time). The way protons are disposed of so that on 
average such a small number remains free is an entirely open question that 
requires understanding the way water is organized in the extremely crowded 
environment of the cytoplasm. This situation has considerable consequences in 
particular for highly charged molecules such as nucleic acids. This is not yet 
really understood [124, 125]. An alternative to safety valves is storage by polym- 
erization, a function fulfilled by a variety of structures and compartments [126], 
and polymerization of nucleotides is a way, rarely considered, to buffer osmotic 
pressure. 


54.4 Time 


The idea that time and transitions are essential in shaping molecules and organ- 
izing cells is also central to the understanding of the addressing, organization 
and motion of proteins within the cell and its membrane. The role of time will be 
one of the most important features of the development of SynBio in the next 
decade. This is because in most research, studies of evolution and phylogeny 
aside, there has been a tendency to account for life in synchronous terms. For 
example, the recent descriptions of the way DNA is folded in cells provide us 
with a fairly static view [127-129]. Yet, it is obvious that except in dormant states, 
DNA is highly flexible and mobile, with movements triggered by transcription 
and all related processes that maintain supercoiling, as opening up the double 
helix locally will trigger a deformation that will propagate [127, 130, 131]. It is 
likely that the organization into macrodomains is fit to coordinate gene expres- 
sion [113], including when transcription involves time-dependent movements of 
the DNA template. 

Considering cells and organisms as computers, making computers exposes a 
considerable possible limitation, where time plays a central role, resulting from 
the fact that expression of the genetic program is highly parallel. Parallelism 
implies that a variety of clocks allow synchronization of gene expression pro- 
cesses [27]. This need for synchronization is likely to be another constraint that 
organizes the genome into macrodomains [89]. Indeed, clocks are found every- 
where in life: coupling of gene expression with seasons [131], circadian rhythms 
[132], and many other kinds of clocks, unrelated to obvious environmental 
parameters [133]. It has been known since the nineteenth century that circuits 
with relevant feedback loops could end up with oscillating properties, de facto 
creating clocks. It is therefore quite trivial to find clocks based on regulatory 
gene expression circuits, an expected property that nevertheless became quite 
fashionable several decades ago, hiding more interesting roles of time. By con- 
trast and more interestingly, other intrinsic clocks, for example, based on the 
aging half-life of macromolecules (such as resulting from isomerization of aspar- 
agine and aspartate in proteins [27, 134]) may bring about unexpected uses of 
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time. In general, the importance of time has been underestimated, in particular 
because laboratory conditions are most often meant to provide steady-state 
invariable conditions. Time-dependent pattern formation, a basis for multicel- 
lular body plan, must be explored with novel approaches [135]. Finally, the role 
of ubiquitous transitions (shifts in temperature, light, metabolites supply, inter- 
actions with other organisms, and simply aging) will certainly need to be explored 
much more in-depth for large-scale SynBio applications. The time scales of DNA 
movements have not been explored in-depth, and, if relevant, this missing 
knowledge might become a limitation for the future of SynBio constructs. 


5.4.5 Information 


SynBio uses cell factories that associate a program with a chassis. As previously 
discussed, the transplantation experiment that implemented a program that did 
not match the receiving host chassis [21] demonstrates the physical material 
separability between machine and program [10, 12]. It also emphasizes another 
point, where information is central: while, at the end of the experiment, the 
donor’s program is identical to that at the beginning, the final machine 
(Mycoplasma capricolum) differs from the initial host machine (Mycoplasma 
mycoides) (Figure 5.2). This implies that some specific input of contextual infor- 
mation (gene expression in a particular environment, at a particular time), and 
not directly related to the information carried over by the genetic program, has 
been involved. In the same way, construction of a young progeny from aged cells 
demonstrates that there is a specific management of information by cells, in a 
way that is highly reminiscent of the way Maxwell’s demons operate [27, 136]. 
Briefly, creating a link between information and entropy, Maxwell introduced 
the idea of a hypothetical being, later seen as a “demon” that uses an in-built 
information-processing ability to reduce the entropy of a homogeneous gas (at a 
given temperature). The demon is able to measure the speed of gas molecules 
and open or close a door between two compartments as a function of the mole- 
cules’ speed, keeping them on one side if fast and on the other side if slow. This 
behavior will build up two compartments, one hot and one cold, reversing time 
and acting apparently against the second principle of thermodynamics. In the 
same way, proteins such as septins prevent aged proteins to go from the mother 
cell to the daughter cells [137] or organize cell division [138], using energy to 
reset their state to ground level [27, 136]. 

Information is split into several components: a genetic memory, carried over 
by DNA via faithful replication, epigenetic memory that reproduces a particular 
state of the chassis, including a specific organization of gene expression, and a 
variety of processes managing information transfers. DNA replication uses an 
asymmetrical nanomachine that breaks the DNA double helix opened at a speci- 
fied origin and starts elongating a continuous strand in the 5’ to 3’ direction. The 
process is straightforward in replication of the leading DNA strand. By contrast, 
replication of the lagging strand poses major structural problems. Indeed, repli- 
cation of that strand requires a considerable length of single-stranded DNA that 
must be protected by specific complexes; it also requires management of multi- 
ple initiation complexes, in contrast to replication of the leading strand, which 
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may start from a unique replication initiation locus [139]. This dissymmetry 
implies that the error replication rates differ on each strand, with different proof- 
reading systems. Many proofreading processes exist, including those, such as the 
ATP-powered RecBCD nanomachine that takes care of double-strand breaks in 
E. coli [140]. 

Transcription operates with constraints similar to those of leading strand rep- 
lication. Following transcription, the protein biosynthetic machinery brings 
together complexes composed of ribosomes, chaperones, and localization fac- 
tors into similar actions (begin, elongate, and end). It also interacts directly with 
factors dedicated to disposal of protein fragments (generated during mistransla- 
tion, translation interruption, or premature termination) and more generally to 
protein degradation [141]. The genetic code accommodates 20 amino acids plus 
two variable ones, selenocysteine (coded for by UGA) and pyrrolysine (coded for 
by UAG). Remarkably it seems that in some organisms, the genetic code can be 
modulated via specific growth conditions [142] and that the UGA codon can be 
reassigned to a particular amino acid, differing from tryptophan or selenocyst- 
eine [143]. This implies that the genome could be read at levels of information 
much more elaborate than those understood until now. Nothing is known about 
the corresponding gene organization in the genome, but this opens up consider- 
ably the possibilities of information management in SynBio constructs. 

Many other functions must be considered in the making of macromolecules 
and eventually implemented in SynBio constructs. Most deal with the fact that 
the threadwire machinery that makes macromolecules cannot fold them readily 
into their final proper three-dimensional shape (discussed in [27] to account for 
the hard time witnessed to succeed in genome transplantation) as well as in 
maintenance of the designed shape. 

Finally, regulation is another key informational process. It is the main subject 
of most present SynBio experiments, many “BioBricks” being DNA segments 
used to construct regulatory logical gates, with strong emphasis on similarity 
with electronic circuits [144]. Some regulatory functions linked to sensing are 
regulated by the widely spread sensor-regulator two-component systems [145], 
where the channeling of information (separating channels is a challenge) has not 
yet been explored. Mechanical sensing is also important during cell growth, as 
well as when gases witness pressure changes [146]. Among the functions of 
information transfer, the control of metabolic and development processes is 
essential. Indeed, regulation lies at the core of the SynBio activities centered on 
the genetic program, and the bulk of the work dealing with BioBricks and the like 
aim at constructing sophisticated regulatory devices [147, 148]. This will not be 
explored further here as regulation is the focus of the vast majority of SynBio- 
devoted work [149]. 


5.5 Conclusion 


SynBio rests on the description of living organisms as separating a genetic 
program from the machine that runs it. In general it is implicitly assumed that 
it is possible to use extant organisms as reference chassis into which on may 
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transplant artificial genetic constructs, with outputs that work well. Indeed, the 
proof of concept of this view has been repeatedly established, in constructing all 
kinds of circuits or metabolic pathways, showing that the key idea of the cell as 
a computer is at least conceptually viable. However, as industrial processes 
require both stability in time and high production, it is important that the proof 
of concept is followed by scaling up in making economically viable constructs. 
We have delineated here some of the constraints that must be taken into account 
to allow a smooth transition from the academic laboratory to the industrial 
scale. 
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The implementation of synthetic genetic circuits requires precise control of the 
expression of every gene involved. This can be achieved by choosing promoters 
that appropriately modulate transcription initiation in terms of intensity and 
duration in response to specific stimuli. In nature, promoters couple gene 
expression to the internal status of the cell and to the external conditions of the 
environment. Here, we describe Saccharomyces cerevisiae promoters. The char- 
acterization of the structural and functional features of natural promoters has 
been crucial for their application. Moreover, this knowledge led to the imple- 
mentation of synthetic promoters displaying novel regulatory properties. 


6.1 Introduction 


The characterization of Saccharomyces cerevisiae promoters began by using 
them to drive expression of reporter genes. Systematic truncations and deletions 
of the promoter region of these constructs revealed that yeast promoters share a 
common modular structure [1, 2]. Each module has a defined role in the stimula- 
tion and regulation of transcription initiation [3, 4]. The characterization of 
natural promoters allowed their use in controlling the expression of heterolo- 
gous genes [5-7]. 

Today, a large selection of well-characterized natural promoters is routinely 
exploited for controlling transcription in yeast [8, 9]. Although these promoters 
span a wide range of transcription initiation efficiencies, they usually do not 
cover them homogeneously; that is, most promoters display either very weak or 
very strong activity. Moreover, natural yeast promoters cannot be used to 
build orthogonal systems, since they are intimately linked to metabolism. These 
two limitations are overcome by constructing synthetic promoters, whose 
strength can be finely tuned and whose regulation can be independent of 
the metabolism. 
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In this chapter, we deal with natural and synthetic promoters frequently used 
in yeast. After giving an overview of the essential features of natural promoters, 
we describe principles and strategies exploited to produce synthetic promoters 
and their cognate transcription factors. We leave out from the discussion other 
aspects of gene expression regulation, like gene copy number, transcription elon- 
gation and termination, transcript processing, mRNA stability, translation, and 
protein stability. 


6.2 Yeast Promoters 


A promoter is a DNA sequence enabling and regulating transcription initiation. 
In this section, we point out the essential structural and functional features 
of yeast promoters. For more detailed descriptions, reviews are available 
[10-14]. 

Yeast promoters consist of two functionally and physically distinguishable 
regions: the core promoter and the upstream element [4, 15, 16] (Figure 6.1, 
top). The core promoter is the region that carries the minimal information 
needed to start transcription, independently of any regulation [3, 15, 17]. It 


TFBSs TATA TSS 


/\\ i, 
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Figure 6.1 Modified natural yeast promoters. Top, typical bipartite structure of yeast 
promoters. Bottom left, promoter libraries obtained by point mutation. Random point 
mutations, illustrated as stars, are introduced by error-prone PCR along the sequence of the 
starting promoter. Bottom right, promoter libraries obtained by substituting non-consensus 
sequences with random oligonucleotides. By concentrating the mutations in the non- 
consensus regions, it is possible to fine-tune the strength of the starting promoter. N: 
nucleotide; ORF: open reading frame; TATA: TATA element; TFBS: transcription factor binding 
site; TSS: transcription initiation start site. 
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provides transcription initiation start sites (TSSs) defined by the consensus 
sequence A(A jich )s NPyA(A/T)NN(A jich ) 6. In this consensus, A is the first 
transcribed nucleotide, N can be any nucleotide, and Py can be only a pyrimi- 
dine [18, 19]. As most yeast core promoters contain several TSSs, several tran- 
script isoforms are usually produced from each promoter [20]. The core 
promoter is a platform for RNA polymerase II recruitment [21]. RNA polymer- 
ase II requires the assistance of general transcription factors to bind the pro- 
moter and become competent for transcription initiation. The general 
transcription factor called TATA-binding protein (TBP) recognizes the TATA 
element, a DNA sequence enriched for T and A [22], placed at variable dis- 
tances upstream of the TSS(s) [18, 19]. The interaction between TBP and the 
TATA element triggers the stepwise recruitment of RNA polymerase II and 
other general transcription factors at the core promoter [22]. This results in the 
formation of the pre-initiation complex (PIC) , which is necessary to start tran- 
scription [12, 14]. Since the interaction between TBP and the TATA element 
directly triggers PIC assembly, the strength of this binding influences the over- 
all transcription initiation efficiency. Therefore, the TATA element can be con- 
sidered as a module acting as a scaling factor within the core promoter: strong 
TATA elements result in strong promoters; weak TATA elements result in 
weak promoters [23]. After PIC formation, RNA polymerase II searches the 
TSS(s) by scanning the template strand [24]. While RNA synthesis is not 
required for the scanning process, the selection of the TSS requires transcrip- 
tion. Indeed, limiting RNA polymerase IJ function leads to selection of TSS(s) 
further downstream [25]. The region between the TATA element and the 
TSS(s) is usually enriched in Ts, while downstream of the TSS(s), A is the pre- 
ponderant nucleotide [26]. In strong promoters this biased nucleotide distri- 
bution is more evident than in weak promoters, suggesting that this feature 
could have an influence on transcription initiation efficiency by probably facil- 
itating the identification of the TSS(s) during scanning [27]. 

The upstream element confers regulation by recruiting transcription factors 
[13]. Elements stimulating transcription initiation are called upstream activation 
sequences (UASs) and have some common features. First, their regulation 
depends on physiological stimuli [3, 16]. Second, their orientation does not affect 
their performance [28, 29]. Third, the distance between UASs and core promoter 
does not usually influence transcription initiation frequency [4, 28]. Fourth, 
UASs do not regulate transcription when they are placed downstream of the 
TATA element [28]. 

The observation that the UAS orientation and the distance from the TATA 
element do not influence transcription suggests that the UAS activity is inde- 
pendent of the core promoter; that is, these two regions do not interact with the 
same sets of proteins [30]. The distinct roles of core promoter and upstream ele- 
ment were demonstrated by constructing the first synthetic hybrid promoter, 
where the original UAS of a promoter was substituted with one of a second pro- 
moter. The resulting construct initiated transcription from the natural TSS of the 
first promoter but showed the typical regulation of the second [3]. This inde- 
pendence underscores the modular structure of yeast promoters, suggesting the 
possibility to combine several upstream elements. The resulting promoter reacts 
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to several physiological stimuli, converging transcription initiation on the TSS(s) 
defined by the core promoter [28, 29]. 

Transcription factors bind the upstream element on specific and well-defined 
DNA motifs called transcription factor binding sites (TFBSs). TFBSs are neces- 
sary and sufficient to confer regulation to a promoter [31]. 

In yeast, the most frequently observed mechanism of transcription initiation 
stimulation is activation by recruitment [32]. Transcription activators bind 
TFBSs located in the UAS. Their role merely consists of indicating the DNA 
region that needs to be transcribed. The binding of the transcription activator to 
its TFBS triggers the recruitment of the coactivators SAGA and TFIID, which in 
turn localize TBP to the core promoter. This array of protein-protein interac- 
tions results in the PIC assembly. 

An important hint about the mechanism of activation by recruitment comes 
from the observation that DNA-binding and transcription activation activities of 
yeast transcription activators are functionally and physically separable. In fact, 
yeast activators display a modular structure containing, among others, a DNA- 
binding domain and an activation domain [33]. Truncations retaining either the 
DNA-binding or activation portion fail to initiate transcription. However, reas- 
sociation of these two portions restores function [34—36]. 

The modular structure of yeast transcription activators implies possible regu- 
lation of the mechanism of activation by recruitment. Masking the DNA-binding 
or activation activity by protein-protein interactions results in the failure of 
transcription initiation. The activity of the general repressor complex Cyc8—Tup1 
consists in binding and covering the activation domains of target transcription 
activators. This interaction causes transcription initiation inhibition, even 
though the transcription activator is bound to its TFBS. Unmasking the activa- 
tion domain by abolishing the interactions with Cyc8—Tup1 results in the 
recruitment of the transcriptional machinery [37]. 

In eukaryotes, DNA is not directly accessible, since it is wrapped around his- 
tones to form nucleosomes (reviewed in [38]). Nucleosomes provide a general 
inhibitory function that reduces basal transcription initiation of all genes 
(reviewed in [39]). As histones have a general affinity for DNA, nucleosomes 
form at random positions along DNA [40]. DNA-binding proteins that recog- 
nize specific binding sites compete with histones to interact with DNA (reviewed 
in [41]). However, the specific interaction of a DNA-binding protein to its bind- 
ing site produces a physical barrier on the DNA that forces nucleosomes to 
phase around this point [40, 42]. In some promoters, nucleosome phasing may 
have an indirect role in transcription initiation stimulation by enhancing the 
accessibility of the TFBS of the transcription activator [42]. After the binding of 
the transcription activator, the nucleosomes must be displaced to assemble the 
PIC and start transcription. Therefore, the transcriptional machinery recruits 
factors involved in nucleosome remodeling [11]. The efficiency of nucleosome 
clearance is influenced by the propensity of DNA to be wrapped into nucle- 
osomes [43]. The homopolymeric dA:dT sequences frequently observed in the 
UASs interact weakly with the histones and therefore cause the inefficient 
formation of nucleosomes in the region. This results in easier nucleosome 
clearance and stronger transcription [44]. Therefore, composition, length, and 
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position of these sequences along promoters have a direct effect on nucleosome 
occupancy and by consequence on transcription initiation efficiency [45]. 

Besides sequences stimulating transcription initiation, some yeast promot- 
ers also carry upstream elements that inhibit the process. These are called 
upstream repression sequences (URSs) and contain TFBSs that bind transcrip- 
tion repressors [12]. Some mechanisms of repression are also based on recruit- 
ment. Here, the binding of the repressor attracts corepressors to the promoter, 
which block transcription initiation by recruiting chromatin remodelers to 
make DNA less accessible for PIC assembly or by preventing the transcrip- 
tional machinery from starting [46]. 


6.3 Natural Yeast Promoters 


We can distinguish two classes of natural promoters: regulated and constitutive. 
A wide selection of these promoters is used today to control gene expression. 
Although natural promoters are popular, their use is frequently limited to special 
genetic backgrounds and/or growth conditions. Nevertheless, the lessons learned 
from nature are essential to create synthetic systems more suitable for biotech- 
nology or synthetic biology applications. 


6.3.1 Regulated Promoters 


The activity of a regulated promoter is, in terms of both timing and intensity, 
specifically dependent on a well-characterized stimulus, for example, chemical 
or physical agent. In many cases the stimulus operates a single specific 
TFBS. The promoters depending on galactose, inorganic phosphate, or copper 
described below are interesting examples. 

The most used regulated promoters belong to the GAL genes, involved in 
galactose catabolism. The mechanism of their regulation is well characterized 
and involves several players (reviewed in [47]). GAL4 is the main regulator of the 
GAL circuit and encodes a transcription activator. In the absence of galactose, 
the inhibitor Gal80 binds Gal4, preventing its activity. In the presence of galac- 
tose, Gal4 is released, as Gal80 is sequestered in the cytoplasm. This triggers the 
transcription of the Gal4 targets, which include GALI, GAL7, and GAL1O, 
encoding the enzymes of the Leloir pathway, and regulators of the circuit, such 
as GAL80, GAL2, and GAL3. The autocatalytic nature of the GAL circuit gives a 
switch-like response to galactose. The interruption of the positive feedback loop 
controlling the expression of the galactose permease GAL2 results in a linear 
response of GAL genes to increasing amounts of galactose. This allows the induc- 
tion of GAL promoters at intermediate levels [48]. Some yeast strains carry an 
extensive deletion in the TRP1 locus resulting in the truncation of the adjacent 
GAL3 promoter [49]. In this background, induction of the GAL genes is not fast 
and efficient, because the levels of Gal3 are low [50]. 

Galactose induces transcription of GAL1 and GAL10 by more than 1000-fold 
[51, 52]. GALI and GAL10 are in close proximity on the genome and diverge in 
their orientation [52, 53]. Deletion analysis of the DNA sequence lying between 
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the two open reading frames revealed the presence of a galactose-dependent 
UAS [54, 55]. The UAS contains two shared TFBSs bound by the transcriptional 
activator Gal4. These TFBSs are sufficient to confer galactose-dependent regula- 
tion to a promoter [31]. The first yeast synthetic hybrid promoter, discussed 
above, was assembled by placing the GAL1-10 UAS upstream of the core pro- 
moter region of another gene. The construct exhibited the typical GAL1-10 
galactose-dependent regulation [3]. 

The promoter of the acid phosphatase (PHOS) contains two homologous 
TFBSs bound by the transcription activators Pho4 and Pho2 when inorganic 
phosphate is depleted in the culture medium. The presence of inorganic phos- 
phate in the medium leads to Pho4 sequestration in a protein complex that 
does not allow its binding to the PHOS promoter [56, 57]. Although this 
promoter displays some basal activity in the repressed state [56], it has been 
successfully used to express heterologous genes, like the hepatitis B surface 
antigen [58]. 

The promoters of genes involved in copper metabolism are frequently used for 
driving transcription of heterologous genes [59, 60]. The CUP/ promoter is stim- 
ulated by copper [61]. Its activation depends on the transcription activator Acel, 
whose ability to bind its TFBSs is controlled by its interaction with copper ions 
[62]. Although this is a popular promoter, its use is limited to strains carrying the 
wild-type CUP1 locus. With this genetic background it is possible to avoid toxic 
effects related to excess copper, since the CUP1 gene encodes a metallothionein 
acting as a copper chelator. However, copper is essential in biological processes 
like respiration; therefore it is usually present in traces in culture media. This 
small amount of copper causes a substantial basal expression of genes under the 
control of the CUP1 promoter. The metallothionein encoded by the wild-type 
CUP1 locus contributes to lowering the amount of available copper. As a conse- 
quence the basal activity of the heterologous construct is lowered [63]. An excess 
of copper prevents the transcription of genes encoding copper transporters, like 
CTR1 and CTR3 [64]. The promoters of these genes contain specific TFBSs 
bound by the transcription activator Macl. When Macl interacts with copper 
ions, its DNA-binding and activation activities are inhibited [65]. A collection of 
expression vectors containing CUP1, CTR1, and CTR3 promoters is available for 
coordinated induction and inhibition experiments [66]. 

Regulation of the promoters described so far depends on a single TFBS. 
However, some promoters display a combination of TFBSs bound by different 
transcription factors. This results in more sophisticated regulation. Several 
examples are described below. 

The promoter of DANI, a mannoprotein, contains a set of TFBSs bound by 
both activators and repressors. The combination of the activity of these tran- 
scription factors results in complete repression in the presence of oxygen and full 
activation when this gas is absent from the culture medium [67]. For induction, 
this promoter requires stringent anaerobiosis, which can be realized by bubbling 
nitrogen in the cultures. However, this experimental setup is not convenient for 
large scale overexpression experiments. As such, random mutagenesis of the 
DANI promoter yielded variants less sensitive to oxygen that can be induced in 
microaerobiosis [68]. 


6.3 Natural Yeast Promoters 


Carbon catabolite repression is the set of regulatory mechanisms forcing cells 
to preferentially use glucose and fructose over other carbon sources (reviewed in 
[69, 70]). For example, the GAL circuit is fully repressed by glucose, even when 
galactose is present in the culture medium [51]. Carbon catabolite repression 
affects the levels and activity of enzymes involved in energy metabolism. We can 
distinguish two main ways of transcription initiation repression by glucose. The 
first way is direct, when the presence of glucose triggers the recruitment of tran- 
scription repressors to the target promoters. Mig] represses transcription initia- 
tion via Cyc8-Tup1. When glucose is depleted, Mig1 is phosphorylated by Snfl 
and is consequently relocalized to the cytoplasm, thus abolishing repression of 
the target promoters [69]. A well-characterized target of Mig] is GAL1, which 
contains the cognate TFBS upstream of its TATA element [54, 55]. The second 
way is indirect, when transcription repression is achieved by inactivating tran- 
scription activators. For example, Adr1 is inactive in the presence of glucose; 
therefore it cannot trigger transcription initiation [71]. A well-characterized 
natural target of this transcription activator is alcohol dehydrogenase 2 (ADH2), 
which is repressed when yeast is grown in glucose [72]. When the TFBS recog- 
nized by Adr1 is cloned in front of the fermentative alcohol dehydrogenase 1 
(ADH1) gene, its expression shows catabolite repression [15]. Carbon catabolite 
repression can be exploited to repress genes of interest until glucose is depleted 
in the culture medium. Besides the ADH2 promoter [73], that of JEN1, the main 
lactate and pyruvate transporter, is also used. The JENI promoter was initially 
selected to construct biosensors for measuring sugar concentrations, since it 
reacts specifically to carbon sources and is insensitive to most types of cell 
stresses [74]. 


6.3.2 Constitutive Promoters 


A constitutive promoter displays a relatively constant activity that is not signifi- 
cantly altered by stimuli. In most cases, the activity of constitutive promoters is 
coupled to the growth rate, which depends on the level of glucose, the preferred 
carbon source of yeast [69, 75, 76]. This constant transcription is ensured by a 
complex combination of TFBSs. 

The most used constitutive promoters belong to genes involved in primary cell 
metabolism such as glycolysis and fermentation. The main reason for this selec- 
tion is historical; mutations affecting these genes were relatively easily isolated 
and characterized. The regulation of the expression of glycolytic and fermen- 
tative enzymes takes place mainly at the transcriptional level and correlates 
with glucose concentration and growth curve stage [76-80]. The promoters of 
these genes share common TFBSs recognized by the regulators Rap1 and Gcrl 
[29, 81, 82], which ensure coordinated transcription [83, 84]. The promoters of 
phosphoglycerate kinase 1 (PGK1) and glyceraldehyde-3-phosphate dehydroge- 
nase 3 (TDH3) are among the strongest ones known [77, 80]. The promoter of 
the fermentative ADH] was one of the first used to overexpress a heterologous 
protein in yeast [5]. A vector containing this promoter was also used to produce 
the human hepatitis B vaccine [7]. Although considered strong, the ADH1 
promoter is weaker than those of PGK1 or TDH3 [9, 80]. 
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The cytochrome c isoform 1 (CYC1) promoter has been extensively charac- 
terized [1, 16]. This gene is involved in cell respiration and its transcription is 
triggered by heme [16, 30]. When this metabolite is present in the cells, it 
binds the transcription activator Hapl, which can then recognize its cognate 
TFBS [30, 85]. Today, a truncated version of this promoter is used to drive 
mild and relatively constant expression in fermentative growth conditions. 
For this reason the CYCI promoter is usually described as a constitutive 
promoter [9]. 

Besides genes involved in energy metabolism, those involved in other basic 
tasks, like cell shape maintenance and translation, also have constitutive promot- 
ers. For example, the promoter of the gene encoding f -actin (ACT) displays a 
combination of regulatory elements ensuring a constant transcription in both 
fermentative and non-fermentative growth conditions [86]. Similarly, the 
translation elongation factor EF-1 o (TEFI) has a promoter ensuring approxi- 
mately stable expression during all growth phases and in media containing differ- 
ent carbon sources [76, 80, 87]. 


6.4 Synthetic Yeast Promoters 


A synthetic promoter carries nonnative sequences. We describe two main groups 
of synthetic promoters. One includes modified versions of natural promoters, 
and the other contains hybrid promoters. As illustrated below, the main differ- 
ence between promoters belonging to each class is the strategy used to construct 
them. 


6.4.1. Modified Natural Promoters 


The systematic modification of a promoter leads to a library spanning a wide 
range of transcription initiation frequencies. Within this library, each member 
drives transcription initiation with a specific strength. Since the members of the 
library are derived from a single promoter, they share similar regulatory features; 
that is, they respond to the same stimulus [88-90]. There are two main methods 
for obtaining promoter libraries: either by introducing point mutations or by 
substituting short sequences with randomized oligonucleotides (reviewed in 
[91]) (Figure 6.1). 

Mutations in essential regulatory sequences are likely to cause a substantial 
change in activity, because they can alter the binding affinity of the cognate 
proteins. A library of TEF1 promoter variants was obtained by error-prone 
PCR. With this approach the point mutations spanned along the complete 
sequence of the promoter [92]. The library covered a range of activities from 
8% to 120% relative to the native TEF1 promoter [93]. A variation of this strat- 
egy consists in limiting the point mutations to specific regions of the promoter, 
for example, to the TATA element. These modifications alter the efficiency of 
the PIC assembly [94]; therefore they affect the overall performance of the 
promoter [23]. 


6.4 Synthetic Yeast Promoters 


An alternative strategy for obtaining promoter libraries is the substitution of 
the non-consensus sequences of the promoter with random sequences. Non- 
consensus sequences usually do not play a direct role in transcription initiation 
regulation; that is, they do not bind to specific proteins. However, those sequences 
might modulate the process indirectly, for example, by keeping the optimal dis- 
tance between functional sequences [91], by influencing the local DNA helical 
parameters [95], or by modulating the efficiency of nucleosome formation and 
clearance [43, 96]. Therefore, promoter variants containing modifications in 
non-consensus sequences will differ from each other by small changes in 
strength. The modification strategy is based on the synthesis of libraries of 
oligonucleotides encoding the promoter sequence. In each oligonucleotide the 
consensus sequences are separated by degenerate stretches of nucleotides of 
variable length [88]. This approach was used to modify the profilin (PFY1) 
promoter. The library obtained spanned a range of activities from 11% to 100% 
relative to the starting promoter [89]. 


6.4.2 Synthetic Hybrid Promoters 


Synthetic hybrid promoters combine DNA sequences originally belonging to 
different promoters, but retain the typical bipartite structure of natural promot- 
ers [91] (Figure 6.2). 

The choice of the core promoter has effects on the overall performance of the 
hybrid promoter, as it controls the efficiency of the PIC assembly [23] and the 
identification of the TSS(s) [26]. Frequently, synthetic hybrid promoters contain 
the native core promoter of inducible genes, for example, LEU2 or CYC1 
[3, 90, 97]. Strong synthetic core promoters have been isolated from DNA librar- 
ies where the TATA element and the TSS consensus sequence were separated by 
a randomized spacer of 30 nucleotides. An additional stretch of 30 nucleotides 
placed between the TATA element and the upstream TFBSs improves the core 
promoter robustness by possibly avoiding steric hindrances between the tran- 
scription factors, TBP, and other general transcription factors that bind to the 
TFBSs or the core promoter [98]. 

The main advantage of the synthetic hybrid promoter approach is the possi- 
bility of using any DNA sequence targeted by a protein as an upstream element. 
Endogenous TFBSs link the synthetic promoter to a regulatory pathway. For 
example, by placing the UAS of GAL1-10 in front of the TDH3 promoter, a 
hybrid promoter that is active in glucose and further stimulated by galactose 
was obtained [90]. Heterologous TFBSs either belong originally to other 
species or are artificial sequences. They enable the implementation of orthog- 
onal transcription systems that are independent of metabolism [99]. In fact, 
heterologous sequences are not recognized by any yeast transcription factor, 
unless they are homologous or, by chance, similar to endogenous sequences 
[34, 97, 100]. Promoters containing combinations of TFBSs are sensitive to 
several stimuli [90]. 

Modification of the binding affinity between the transcription factor and its 
cognate TFBS results in promoter strength modulation [99, 101]. Alternatively, 
the strength of the promoter can be tuned by increasing the number of TFBS 
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Figure 6.2 Synthetic hybrid promoters are obtained by combining a core promoter and one 
or more transcription factor binding sites (TFBSs). Each TFBS is specifically recognized by a 
transcription factor. By selecting the TFBSs, it is possible to choose which regulation the 
synthetic hybrid promoter should display. The combination of two or more different TFBSs 
results in a combinatorial regulation. The multiplication of the TFBS copy number results in 
the adjustment of the promoter strength. 


copies; in most cases, a linear relationship is observed [90, 99, 102-104]. The 
activity of promoters also depends on the spacer sequences placed between regu- 
latory elements [59, 88, 105, 106]. These sequences can have an impact on the 
efficiency of nucleosome clearance [43, 96]. Homopolymeric dA:dT or dG:dC 
stretches disfavor nucleosome formation, thereby increasing transcription initia- 
tion efficiency [44]. On the contrary, DNA sequences containing dA:dT dinucleo- 
tides alternating with dG:dC dinucleotides are wrapped very efficiently into 
nucleosomes, thereby inhibiting transcription initiation efficiency [43]. Therefore, 
it is possible to fine-tune initiation efficiency by modulating the length, composi- 
tion, and location of dA:dT or dG:dC stretches within the promoter [44, 45]. 
Finally, it is also necessary to avoid the formation of structures that may have an 
unpredictable influence on the promoter performance. For example, placing a 
transcriptional terminator-like sequence between the upstream element and the 
core promoter can depress transcription initiation [107]. 


6.4 Synthetic Yeast Promoters 


Hybrid promoters containing heterologous TFBSs need a heterologous tran- 
scription factor. To ensure orthogonality, this transcription factor should not 
have any other target than the heterologous promoter itself. The modular struc- 
ture of natural transcription factors suggests that it is possible to combine differ- 
ent protein domains to obtain new factors. 

A protein able to bind DNA can stimulate transcription when fused to an acti- 
vation domain. The first heterologous transcription factor tested in yeast con- 
tained the bacterial DNA-binding protein LexA fused to an activation domain 
and triggered transcription of promoters containing LexA TFBSs [34, 108, 109]. 
Transcription activators containing the bacterial DNA-binding protein ¢efR are 
regulated by tetracycline [97, 110], which prevents the binding of tefR to the 
cognate TFBSs [111-113]; therefore, the expression of the target gene can be 
modulated by adjusting the concentration of this chemical in the culture medium. 
A reverse tetR mutant, which binds its TFBS upon addition of tetracycline, is also 
available. However, transcription activators containing reverse tetR have a rela- 
tively strong basal activity in the absence of tetracycline [114]. 

Hybrid promoters containing artificial TFBSs require the construction of 
artificial DNA-binding domains. Zinc fingers and transcriptional activator-like 
effectors (TALEs) are short peptidic modules binding to specific and short DNA 
sequences. Protein engineering has diversified these modules, and libraries of 
protein moieties recognizing virtually all DNA sequences of three to four nucleo- 
tides have been constructed. By fusing several zinc fingers or TALE modules, 
it is possible to obtain arrays that specifically bind longer DNA sequences 
[89, 115, 116]. Artificial transcription activators are obtained by fusing these 
DNA-binding domains to activation domains [100, 106]. 

In general, any mechanism able to target an activation domain to the DNA can 
be used to stimulate transcription initiation. In the clustered regularly inter- 
spaced palindromic repeats (CRISPR)-derived system, the target DNA sequences 
are identified via RNA-mediated interactions, instead of binding of a protein. In 
this system, the DNA-binding activity consists of a single-guide RNA (sgRNA), 
which targets specifically the DNA region to be regulated, and the catalytically 
inactive version of the protein Cas9 (dCas9), which binds specifically the sgRNA. 
By fusing dCas9 to an activation domain, a transcription activator is obtained 
[117, 118]. 

The activation domain stimulates transcription initiation by establishing 
protein-protein interactions with coactivators and components of the transcrip- 
tional machinery [119, 120]. While DNA-binding domains have well-defined 
conserved architectures (reviewed in [12]), activation domains do not share 
common structures, except for a marked acidity [35, 121]. They usually consist of 
multiple unstructured acidic patches; each acidic patch triggers transcription 
initiation when fused to a DNA-binding domain [122]. Any peptide stretch 
displaying such properties can be used to activate transcription [123]. 

The DNA-binding and activation activities do not need to reside on a unique 
protein but can be physically separated on two different molecules. An interac- 
tion between these two is sufficient for transcription initiation. This is exploited 
in the yeast two-hybrid assay [124, 125]. This principle was also used to construct 
a light switchable system. Here, the DNA-binding domain and the activation 
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domain of Gal4 are separated and fused to the chromoprotein phytochrome 
PhyB and its interactor Pif3, respectively. Red light converts the PhyB fusion 
into its active form, which interacts with the Pif3 fusion, activating transcription. 
Far-red light converts the PhyB fusion into its inactive form, which cannot bind 
Pif3, disabling transcription initiation [126]. 

The activity of heterologous transcription factors can be precisely controlled 
by fusing additional domains that, for example, trigger nuclear localization upon 
a specific stimulus. The human estrogen receptor, when fused to a transcription 
activator, confers a hormone-dependent regulation. This chimera triggers tran- 
scription initiation only when 6 -estradiol is added to the culture medium [102]. 
Binding of the hormone to the estrogen receptor causes the nuclear localization 
of the transcription activator, which, in the absence of inducer, is diffusing all 
over the cell [127]. An activator containing the Gal4 DNA-binding domain and 
the estrogen receptor binds GAL promoters, but its activity does not depend on 
the carbon source [127, 128]. Estrogen-regulated activators based on heterolo- 
gous DNA-binding domains such as LexA or synthetic zinc fingers result in 
orthogonal systems that specifically regulate the expression of the target pro- 
moters [100, 102]. The LexA-based activator induces the expression of the target 
gene in different growth conditions. Its overall activity can be finely tuned with 
the concentration of B -estradiol in the culture medium, the number of LexA 
TFBSs in the target promoter, and the choice of the activation domain [102]. 

An essential aspect of regulated synthetic promoters is the tightness of their 
regulation. A promoter is tightly regulated when it does not have any basal activ- 
ity in the absence of the stimulus. The basal activity of some promoters depends 
on the residual activity of the transcription activator in the absence of the stimu- 
lus [114]. Alternatively, the basal expression can be the consequence of ectopic 
transcriptional events starting upstream of the promoter itself [20]. In this case, 
the insulation of the synthetic transcription unit is necessary. This can be 
obtained by placing a transcriptional terminator in front of the synthetic 
promoter [129]. 

Regulation of gene expression can also be achieved by repression of transcrip- 
tion initiation. A protein binding the DNA between the upstream element and 
the core promoter or within the core promoter prevents the establishment of the 
interactions needed for the effective recruitment of the PIC, causing transcrip- 
tion repression by steric hindrance [33, 36, 89, 107, 130]. The DNA-binding pro- 
tein tetR was used to systematically study this kind of repression. A collection of 
GALI promoter variants containing different number of tetR TFBSs placed 
between the TATA element and the TSS was tested. It was observed that increas- 
ing the number of such TFBSs reduced the basal expression of the system. 
Moreover, repression was stronger when the TFBSs were placed in close proxim- 
ity to the TATA element [103, 105]. At intermediate levels of induction, the 
expression levels of the genes targeted by tetR showed a broad cell-to-cell varia- 
bility. Reduction of such a cell-to-cell variability was obtained by placing tetR 
expression under its own control, implementing a negative feedback loop [131]. 
TetR expression under negative feedback control also resulted in a “linearized” 
dose-response curve, allowing for larger concentration ranges of tetracycline 
and therefore better titratability. This negative feedback-based concept has also 
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been applied to mammalian synthetic gene circuits [132]. Repression by steric 
hindrance can also be obtained by using a CRISPR-derived system. A complex 
consisting of sgsRNA and dCas9 competes with the transcription activator of the 
target promoter, as it targets the same TFBS. The sgRNA—dCas9 complex pre- 
vents transcription initiation by blocking the access of the TFBS [117]. However, 
neither the ¢etR nor other bacterial repressor domains nor the CRISPR-based 
systems show a sufficiently tight repression of the basal level to be useful in a 
broad setting. 

The basal level of heterologous repression systems can be further reduced by 
fusing a DNA-binding domain to a component of the eukaryotic transcriptional 
repressor complex, like Tup1 or Cyc8. LexA, when fused either to Tup1 or 
to Cyc8, mediates repression of hybrid promoters containing LexA TFBSs 
[133, 134]. The CRISPR-derived system has been used in a similar strategy. 
dCas9 was fused to a mammalian repressor that recruits a yeast histone deacety- 
lase. This repressor was targeted to the TEF1 promoter by designing a specific 
sgRNA [117]. As a drawback, these systems slow down the transcriptional induc- 
tion kinetics and affect the expression levels of (endogenous) genes located in 
close proximity. 


6.5 Conclusions 


In this chapter, we highlighted some examples of both regulated and constitutive 
natural yeast promoters. The characterization of these sequences allowed for the 
identification of structural and functional features that are exploited to build 
synthetic promoters and heterologous transcription factors. Examples of the 
application of these promoters in synthetic biology have been reviewed in [75, 91, 
135-137]. 

Today, in the context of implementation of novel functions in cells, the 
construction of robust promoters is crucial [99]. Recent efforts to transform 
biotechnology and synthetic biology into more engineering-like disciplines 
also motivate the construction of synthetic promoters. In fact, their implemen- 
tation is an essential step for the abstraction and standardization of concepts 
like promoter structure and transcription initiation [138]. In this perspective, 
modularization and orthogonality are the aspects that need to be further 
developed. 

Modularization enables the implementation of new promoters and transcrip- 
tion factors by combining well-characterized and structurally independent mod- 
ules. Orthogonal systems do not depend on and influence the endogenous 
metabolism. This ensures a robust behavior and the possibility to reuse the sys- 
tem in different environments and contexts. Today, zinc finger, TALE, and 
CRISPR-based technologies allow the design of artificial transcription factors 
that recognize unique sequences [99, 100, 139]. Additionally, the CRISPR-based 
toolkits available now allow for simple construction of strains containing con- 
structs with the different mechanisms discussed in this review [140]. Together, 
modularization and orthogonality ensure versatility and the possibility to easily 
construct new systems with improved or new functionalities. 
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Definitions 


Transcription initiation: Initial steps of transcription until the formation of the 
first RNA bond. 

Promoter: A DNA sequence enabling and regulating transcription initiation. 

Transcription initiation start site (TSS): The first nucleotide of a DNA 
sequence to be transcribed into RNA. 

Core promoter: Promoter region defining the TSS(s) and the assembly of 
the PIC. 

Pre-initiation complex (PIC): Protein complex containing RNA polymerase II 
and the general transcription factors that assemble on the core promoter. 

Transcription factor: A protein regulating transcription initiation. A transcrip- 
tion factor is not a subunit of RNA polymerase IL. 

Upstream element: Promoter region conferring regulation to transcription 
initiation. It contains transcription factor binding sites (TFBSs). 

Orthogonal system: A system that is independent of cell physiology. 
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Most human genes are interrupted by one or more introns that have to be 
removed to generate mRNAs with intact open reading frames (ORFs), a process 
called pre-mRNA splicing. A ribonucleoprotein complex, the spliceosome, is 
responsible for the accurate removal of the intervening sequences. Alternative 
splicing, that is, not all exons are included in the mature mRNA every time, cre- 
ates the possibility that one gene can encode for more than one protein. This 
immensely increases the coding capacity of a genome. In humans, aberrant splic- 
ing has been recognized to be the causative agent of several hereditary diseases 
and to drive cancer progression. In contrast to humans, introns are rare in bud- 
ding yeast but seem to be important for fine-tuning gene expression and growth 
under stress conditions. 


7.1. The Discovery of “Split Genes” 


In 1977, Richard J. Roberts and Phillip A. Sharp studied adenovirus type 2, a 
double-stranded DNA virus causing common cold. Their aim was to map the 
location of the genes on the viral genome. Unexpectedly, they found that the 
mRNA did not hybridize to the DNA in a continuous stretch. Instead, it hybrid- 
ized to four neighboring segments in the genome, separated by three intervening 
sequences. These intervening sequences were looped out in the DNA as they 
were missing in the mRNA sequence [1, 2]. This came as a surprise, as former 
analyses of bacterial genes suggested that a gene comprises a continuous stretch 
of DNA. Soon after this initial discovery, this discontinuous gene structure was 
shown to be a common feature of eukaryotic genes. Sixteen years later, both were 
awarded the Nobel Prize in Physiology or Medicine for their discovery of “split 
genes.” 

The realization that eukaryotic genes are comprised of exons (sequences of a 
gene included into the mature mRNA) and introns (intervening sequences 
removed upon splicing) called for anew mRNA maturation process: the removal 
of the intronic sequences from the pre-mRNA to yield a shortened mature 
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mRNA (splicing). Subsequently a complex machinery that deletes the interven- 
ing sequences of the pre-mRNA was identified: the spliceosome. The discovery 
that not all exons are included in the mature mRNA every time came as a further 
surprise. This process was appropriately called alternative splicing and opened 
up the possibility that one gene could code for more than one protein. 

Alternative splicing is highly regulated during development and different 
mRNA isoforms are important for determining the fate of different cell types and 
tissues. Therefore, (alternative) splicing is viewed as an integral part of mRNA 
maturation in eukaryotes, and aberrant splicing has not only been recognized to 
be the causative agent of several hereditary diseases but also to drive cancer 
progression. 


7.2 Nuclear Pre-mRNA Splicing in Mammals 


7.2.1 Introns and Exons: A Definition 


The average human gene contains eight exons with a mean length of 145 nucleo- 
tides and introns more than ten times this size [3]. Cis-acting elements encoded 
in the pre-mRNA provide the information that defines an intron (see Figure 7.1). 
The 5’ splice site marks the beginning of the intron and includes the dinucleotide 
GU encompassed within a larger, less conserved consensus sequence. The 3’ end 
of the intron carries three conserved sequence elements. The branch point is 
usually an adenosine located within a less conserved sequence element (branch 
site), typically located 18-40 nucleotides upstream from the 3’ splice site. It is 
followed by the polypyrimidine tract and a terminal AG dinucleotide at the 
extreme 3’ end of the intron [4, 5]. The vast majority of introns contain the 
canonical splice sites GU-AG (99%). However, other categories exist that occur 
rarely, including the noncanonical splice sites GC-AG and AU-AC [6]. 


7.2.2 The Catalytic Mechanism of Splicing 


The splicing process consists of two consecutive transesterification reactions. In 
the first step, the 5’ exon—intron junction is attacked by a free hydroxyl group 
provided internally by the 2’ hydroxyl group from the branch point adenosine. 
This leads to cleavage at the 5’ splice site and ligation of the 5’ end of the intron 
to the 2’ hydroxyl group of the branch point adenosine. In the second step, the 
free 3’ hydroxyl group of the released 5’ exon in turn attacks the phosphate at the 
3’ intron—exon border. This results in ligation of the two exons and the release of 
the intron in form of a lariat (reviewed in [5, 7, 8]). 


7.2.3 A Complex Machinery to Remove Nuclear Introns: 
The Spliceosome 


Splicing is catalyzed by the spliceosome, a large and highly dynamic macro- 
molecular ribonucleoprotein complex that assembles on the intron-containing 
pre-mRNA. The major spliceosome consists of the U1, U2, U4/U6, and U5 small 
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Figure 7.1 Conserved sequence elements of mammalian and budding yeast pre-mRNAs. Exons (cylinders) are separated 
by introns (lines). The consensus sequences in mammals and budding yeast at the 5’ splice site, branch site, and 3’ splice 
site are as indicated. N is any nucleotide, R is purine, and Y is pyrimidine. Mammals contain a polypyrimidine-rich stretch; 


S. cerevisiae contains a polyuridine-rich stretch. Both are located between the branch site and the 3’ splice site. In 


mammals, cross-exon complexes are formed during early stages of spliceosome assembly, while in S. cerevisiae the introns 
are defined. The spliceosomal snRNPs U1 and U2 (green) are shown interacting with the splice sites. Mammals additionally 
have the U2 auxiliary factor (U2AF), U2AF65 and U2AF35 (green) interacting with the 3’ splice site. They also use auxiliary 

regulatory elements that either enhance the splicing process, namely, exonic and intronic splicing enhancers (ESE and ISE, 


dark green cylinders), or inhibit spliceosome assembly, such as exonic and intronic splicing silencers (ESS and ISS, red 
cylinders). These elements are often bound by SR proteins and hnRNPs. 
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nuclear ribonucleoprotein particles (snaRNPs). Each snRNP consists of an snRNA 
(two in the case of U4/U6) and seven Sm proteins that form a ring-shaped struc- 
ture (U6, as an exception, contains Sm-like proteins). Each snRNP contains 
additionally a variable number of particle-specific proteins. Furthermore, a large 
number of auxiliary proteins assemble co-transcriptionally on nascent pre- 
mRNAs to accurately recognize the splice sites [5, 9-11]. 

The cis-acting pre-mRNA sequence elements help to define the splice sites and 
mediate interactions between the pre-mRNA and components of the spliceo- 
some [12-14]. The 5’ splice site interacts with the U1 snRNP via base pairing 
between the splice site and the 5’ end of the U1 snRNA. The 3’ end is consecu- 
tively recognized by several proteins, including non-snRNP factors like splicing 
factor 1 (SF1), which binds to the branch point. The U2 auxiliary factor (U2AF), 
a heterodimer consisting of a 65 and a 35kDa subunit, binds the polypyrimidine 
tract and the 3’ splice site. These factors form the early (E) complex. In a subse- 
quent step, the E complex is joined by the U2 snRNP that binds to the branch 
point forming the A complex. This structure is then bound by the preassembled 
tri-snRNP consisting of the U5 and the U4/U6 snRNPs, generating the precata- 
lytic B complex. The B complex undergoes major rearrangements in RNA-RNA 
and RNA-protein interactions, leading to the destabilization of Ul and U4 
snRNP binding. This catalytically activates the B complex to mediate the first 
catalytic step of splicing and yields the C complex, which in turn catalyzes the 
second step. The spliceosome then dissociates and is recycled for additional 
rounds of splicing [5, 7, 11]. Lately several high-resolution structures of different 
spliceosomal complexes from budding yeast and humans have been solved using 
cryo-electron microscopy (e.g., [15-17]). These structures give unprecedented 
insight on the architecture of the different complexes and aid our understanding 
of the structural rearrangements that have to occur to complete one catalytic 
cycle. 


7.2.4 Exon Definition 


When the length of an intron exceeds 200-250 nucleotides, which is the case for 
most introns in higher eukaryotes, early splicing complexes form across an exon 
[18], a process called exon definition [19]. During exon definition, the U1 snRNP 
binds to the 5’ splice site downstream of an exon and promotes the association of 
U2AF with the polypyrimidine tract at the upstream 3’ splice site. This leads 
subsequently to the recruitment of the U2 snRNP to the branch point upstream 
of the exon. The complex is stabilized by the binding of additional proteins of the 
serine/arginine (SR) protein family (see Section 7.5.2) to enhancer elements 
within the exon [20, 21]. In addition, exon definition might be facilitated by pairs 
of intronic enhancer elements flanking constitutive as well as alternatively spliced 
exons [22]. 

Before proceeding to the splicing reaction, exon-defined complexes must be 
converted to intron-defined complexes. This requires disruption of the cross- 
exon interactions, followed by conversion into a cross-intron A complex, in 
which a molecular bridge is formed from U2 to U1 bound to an upstream 5’ 
splice site [21, 23]. In an alternative assembly pathway, the tri-snRNP is already 
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present in the exon definition complex, interacting with the U2 snRNP by base 
pairing between the U2 and U6 snRNAs [24]. Such complexes can then be 
directly converted to precatalytic B-like complexes, without prior formation of 
the cross-intron A complex [24]. 

Splicing mainly occurs co-transcriptional (see Section 7.5.4) with a 5’ to 3’ 
directionality. Exceptions to this rule include introns flanking alternatively 
spliced exons with the excision being delayed or even happening posttranscrip- 
tionally [25, 26]. 


7.3 Splicing in Yeast 


7.3.1 Organization and Distribution of Yeast Introns 


From an evolutionary perspective, yeasts are a highly diverse group of single- 
celled microorganisms within the kingdom of fungi. The budding yeast (also 
“true yeast”), including the well-known Saccharomyces cerevisiae, belongs to the 
phylum Ascomycota. Other yeasts, like the fission yeast Schizosaccharomyces 
pombe, belong to the phylum Basidiomycota. Both S. cerevisiae and S. pombe are 
eukaryotic model organisms. 

The genome of S. pombe was sequenced in 2002 [27]. It contains ~4800 genes 
with ~43% of the genes containing up to 15 introns. The average intron size is 81 
nucleotides. With regard to splicing factors and 3’ splice site selection, splicing in 
S. pombe is considered to be more similar mechanistically to mammals than in 
S. cerevisiae [28, 29]. We will still focus exclusively on budding yeast in the 
following sections, as S. cerevisiae is the more widely studied eukaryotic model 
organism. 

In contrast to mammals (and fission yeast), very few genes in budding yeast 
code for introns. Only 5% of the ~5800 genes contain introns [30]. In addition, 
most genes contain only one intronic sequence; a mere 10 genes code for 2 
introns. Why does S. cerevisiae have such an intron-poor genome? In general, 
unicellular eukaryotes seem to be under pressure to loose introns. A correlation 
exists between the intron density of a genome and the logarithm of the genera- 
tion time of an organism: organisms with a short generation time tend to have 
fewer introns when compared with more slowly growing organisms [31]. This 
observation could be explained by selection for smaller genomes and for faster 
protein production, for example, in response to stress conditions. 

Intron boundaries in S. cerevisiae are well defined, with a 6 bp sequence at 
the 5’ splice site and a 7bp sequence at the branch site required for efficient 
splicing (see Figure 7.1) [32-34]. The average distance between the branch 
point and 3’ splice site is 30 nucleotides and this region also contains a poly(U) 
tract (see Figure 7.1) [30, 35]. Introns tend to be short, with an average length 
of 154 nucleotides in non-ribosomal and 408 nucleotides in ribosomal 
proteins. 

Introns are not equally distributed in the S. cerevisiae genome, but are highly 
enriched in ribosomal proteins. Eighty nine of the 137 ribosomal proteins (>60%) 
code for at least one intron, whereas only 198 of the remaining genes (<1%) 
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contain introns [30]. Furthermore, the location of introns within a gene is 
strongly biased toward its 5’ end [31]. This bias is thought to arise due to homol- 
ogous recombination of a gene’s cDNA with its genomic copy and could simulta- 
neously explain how introns might have been lost during yeast evolution. cDNA 
arises by reverse transcription of mRNA, which does not contain introns and is 
a by-product of the activity of retrotransposons. Reverse transcription starts at 
the 3’ of the mRNA and often terminates prematurely, which leads to a 3’ bias in 
cDNAs and, as a result, to preferential loss of introns at the 3’ end of genes after 
recombination of the cDNA with the genomic copy. 

Furthermore, S. cerevisiae introns tend to be located in highly expressed 
mRNAs. 27% of all mRNAs produced per hour are generated from the 5% 
intron-containing genes [36]. Genome-wide analyses of mRNA [37] and protein 
[38] levels showed that, on average, intron-containing genes produce ~3.9-fold 
more RNA and 3.3-fold more protein than intronless genes. 


7.4 Splicing without the Spliceosome 


7.4.1. Group | and Group II Self-Splicing Introns 


Interrupted genes are found not only in the genomes of yeast and metazoan, but 
are present in all classes of organisms. The majority of introns are spliced out by 
the spliceosome (nuclear pre-mRNA introns). Besides this, self-splicing introns 
(group I and group II) exist, in which the intervening sequences can excise them- 
selves from the RNA in an autocatalytic manner [39]. 

Group I and II introns are found in the DNA of organelles, bacteria, and the 
nucleus of lower eukaryotes (group I only). Their occurrence is more sporadic in 
bacteria than in lower eukaryotes and is most common in the organelles of higher 
plants. Whereas group II introns are mainly found in organelles, group I introns 
interrupt rRNA, mRNA, and tRNA in bacteria, as well as in the organelles of 
lower eukaryotes, and some plants. In addition, they have been found in several 
bacteriophages. 

Nuclear pre-mRNA introns are defined by cis-acting sequence elements that 
are recognized by the spliceosome. Group I and group II introns, in contrast, 
adopt a typical secondary structure that contains distinct domains, which then 
folds into a highly complex tertiary structure. As a consequence, the catalytic 
mechanism of this splicing reaction solely depends on the sequence and the cor- 
rect folding of the intron. The RNA tertiary structure brings the 5’ and the 3’ 
splice sites in close proximity and generates a catalytic site. The fold is stabilized 
by several magnesium ions, allowing the RNA to perform the splicing reaction in 
vitro by itself, without any enzymatic activities provided by proteins. Proteins are 
required only to assist correct folding of the complex structure in vivo. For this 
fundamental discovery that RNA can harbor catalytic function, Tom R. Cech 
(together with Sidney Altman) was awarded with the Nobel Prize in Chemistry 
in 1989 [40]. 

For group I introns, the only factors required for autosplicing are monovalent 
and divalent cations and a guanine nucleotide cofactor. The 3’ hydroxyl group of 
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the cofactor attacks the 5’ end of the intron, resulting in the first transesterifica- 
tion reaction. The free 3’ hydroxyl of the first exon thus generated then attacks 
the junction between intron and second exon, leading to the second transesteri- 
fication step. Consequently, the intron is released as a linear molecule that 
circularizes later [41]. 

Group II introns share the same catalytic mechanism as nuclear pre-mRNA 
introns excised by the spliceosome with the first nucleophilic attack of a branch 
point adenosine, resulting in lariat formation of the intron (see Section 7.2.2) 
[42]. Interestingly, recent data indicate that the U6 snRNA of the spliceosome 
catalyzes both splicing steps by positioning divalent metal ions so that they sta- 
bilize the leaving group during each reaction. Notably, all ligands of the catalyti- 
cally active metal ions in the U6 snRNA correspond to ligands observed to 
position catalytically active divalent metals in the crystal structures of group II 
intron RNAs [43]. This agreement indicates that group IJ introns and the spli- 
ceosome share common catalytic mechanisms and probably common evolution- 
ary origins [44]. It also suggests that splicing evolved from an autocatalytic 
reaction inherent to an individual RNA molecule [45]. As splicing became more 
complex, proteins started to play a more important role. Importantly, the simi- 
larities between the catalytic core of the group II intron and the U6 snRNA sup- 
port the hypothesis that spliceosomal introns in eukaryotes developed out of 
group II self-splicing introns [46]. 


7.4.2. tRNA Splicing 


The splicing of tRNAs in archaea and eukarya is the only example of intron 
removal that does not involve transesterification, but instead successive cleavage 
and ligation reactions. tRNAs contain a single intron located one nucleotide next 
to the anticodon. These introns are short (14—60 nucleotides) and have no con- 
sensus sequence. They are recognized by an endonuclease that detects a com- 
mon secondary structure of the tRNA rather than a sequence element. It cleaves 
both ends of the intron generating two tRNA halves that are subsequently joined 
by an RNA ligase [47]. 


7.5 Alternative Splicing in Mammals 


7.5.1 Different Mechanisms of Alternative Splicing 


Alternative splicing affects 95% of all human genes [48, 49] and produces multi- 
ple mRNA molecules from a single gene. The resulting proteomic diversity 
is important for many different cellular processes, including cell growth and 
differentiation [13]. 

Alternative splicing events can be divided into four major categories: inclusion 
and exclusion of (cassette) exons, the usage of alternative 5’ or 3’ splice sites, and 
the retention of entire introns (see Figure 7.2). Of these, the cassette exon type 
accounts for approximately one third of all alternative splicing events in humans 
[50]. Cassette exons are either fully included or excluded in the mature mRNA. 
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Figure 7.2 Alternative splicing events in mammalian transcripts. The main types of alternative 
splicing, which are responsible for the generation of different transcripts, are depicted. Dark 
gray indicate constitutive, and light gray cylinders alternative exons. 


In certain cases, multiple cassette exons can be mutually exclusive, producing 
mRNAs that always include one of several possible exon choices, but not more. 
Additionally, the use of alternative 5’ or 3’ splice sites can lengthen or shorten 
exons, a mechanism that accounts for 25% of all alternative splicing events [50]. 
Finally, the failure to remove an intron leads to a splicing pattern called intron 
retention. All four types can occur in the translated or untranslated regions 
(UTRs) of any given pre-mRNA [51]. 

Many genes show multiple splicing patterns, often in conjunction with the 
usage of alternative promoters or polyadenylation sites. One striking example is 
the fast skeletal troponin T (tnnt3) gene, which is part of the troponin complex 
and undergoes extensive alternative splicing. The tnnt3 gene encodes 19 exons, 
including five alternatively spliced exons (exons 4-8) and a pair of mutually 
exclusive exons (exons 16 and 17) [52]. While isoforms including exon 17 (or B) 
are predominantly expressed throughout development, exon 16 (or «)-containing 
isoforms are mostly abundant in adult muscles [53, 54]. In addition, the tant3 
gene contains a developmentally regulated fetal exon F located between exons 8 
and 9 [55, 56]. 
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Hence, multiple isoforms with structural differences are produced and regu- 
lated during muscle development and adaptation. Changes in splicing enable 
transitions from large to small, from acidic to basic isoforms during muscle 
development [57]. Furthermore, the resulting protein isoforms show differences 
in their sensitivity to Ca’* activation and their cooperativity of contraction 
[58-62]. Recently, differential expression patterns of tnnt3 pre-mRNAs were 
observed in rat skeletal muscle in response to variation in body weight and also 
in C2C12 muscle cells upon mechanical stretching [63, 64]. 

Apart from physiological adaptations, aberrant splicing of the tunt3 gene may 
contribute to disease development. An aberrant splicing pattern was identified 
in myotonic diseases type 1 and 2 [65]. In mice overexpressing FRG1 (FSHD 
region gene 1), aberrant splicing of the tuut3 pre-mRNA leads to an anomalous 
fast skeletal troponin T isoform that characterizes dystrophic symptoms [66]. 


7.5.2 Auxiliary Regulatory Elements 


To allow for a correct decision as to which exon is removed or included, addi- 
tional RNA sequence elements and regulatory proteins are required. A genome- 
wide study of alternative splicing in mammalian tissues revealed an important 
role of RNA-binding proteins in splicing regulation via their interaction with 
cis-acting regulatory elements [67]. The relevant RNA sequence elements are 
categorized depending on their function and position. Sequences enhancing the 
splicing reaction are known as exonic splicing enhancer (ESE) or intronic splicing 
enhancer (ISE), while sequences that inhibit splicing are called exonic splicing 
silencer (ESS) or intronic splicing silencer (ISS) [9]. 

In general, splicing regulators appear to exhibit position-dependent effects on 
splicing outcomes [68-70]. One family of RNA-binding proteins are the SR-rich 
proteins. To interact with the RNA, they contain one or two N-terminal RNA 
recognition motifs (RRMs). Additionally, they contain a unique, variable-length 
RS domain at their carboxyl-terminus that functions as a protein interaction 
domain [71-73]. The core SR protein family consists of 12 members, named 
serine/arginine-rich splicing factors SRSF1-SRSF12, respectively [74]. SRSF1 and 
SRSF2 were discovered for their essential roles in constitutive and alternative 
splicing [75]. They promote both U1 snRNP binding to the 5’ splice site and U2 
snRNP binding to the 3’ splice site, allowing for communication between these 
recognition events [76-79], facilitating exon definition (see Section 7.2.4). The 
role of SR proteins in splice site selection is discussed in Section 7.5.3. In addition, 
individual SR protein expression is subject to extensive auto- and cross regulation 
[80, 81]. They also interact with chromatin [82], couple with the transcription 
machinery [83, 84], and are involved in mRNA export [85]. The regulation of SR 
protein activity occurs at the posttranslational level. Site- or region-specific phos- 
phorylation, catalyzed by specific SR protein kinases, is essential to modulate 
their functions during different stages of RNA processing (reviewed in [86]). 

Another family of splicing regulators is the extended family of heterogeneous 
nuclear ribonucleoproteins (hnRNPs). This family includes an initially identified 
set of more than 20 polypeptides, designated hnRNP A to U [87]. Their number 
has further increased as many splicing isoforms, paralogs, and newly identified 
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proteins have been added based on structural and functional considerations 
(reviewed in [88]). Affiliated are the following proteins: Nova, Sam68, QKI, TDP- 
43, TIA, Hu, Fox, CUG-BP, MBNL, and ESRP proteins [88]. hnRNPs share a 
modular structure, most frequently containing one or more RRMs, one of the 
most abundant protein domains found in eukaryotes [89, 90]. The K homology 
(KH)-type RNA-binding domain occurs in the hnRNP proteins K and E and in 
the hnRNP-like proteins Nova, Sam68, and QKI [88]. In addition to that, many 
hnRNPs contain RGG boxes (repeats of Arg-Gly-Gly) and other auxiliary 
domains, such as acidic and glycine- or proline-rich domains [91, 92]. Most 
hnRNPs shuttle between nucleus and cytoplasm [93]. 

A few examples shall give an overview of the multifunctionality of hnRNPs: 
hnRNP A1 is one of the most abundant and ubiquitously expressed members 
(reviewed in [94]). Its role is not limited to splicing regulation, but includes func- 
tions in transcription [95-97], mRNA stability [98, 99], mRNA export [100], 
translation [101, 102], and telomere maintenance [103, 104]. Another example is 
polypyrimidine-tract-binding protein (PTB) or hnRNP I (reviewed in [105]), 
which is involved in splicing [106], mRNA stability [107], and polyadenylation 
[108, 109]. It also stimulates translation initiation at picornavirus internal ribo- 
some entry site (IRES) elements [110, 111]. HnRNP L contains four RRMs that 
specifically recognize CA-repeat and CA-rich RNA elements [112]. It partici- 
pates in intronless mRNA export [113, 114], translational regulation [98], mRNA 
stability [112, 115], poly(A) site selection, and alternative splicing [112]. HnRNP 
L competes with microRNAs for binding to a CA-rich RNA element within the 
vegfa (vascular endothelial growth factor A) 3’ UTR [116]. Recently, activities of 
hnRNP L were analyzed on a genome-wide level, and an in vivo enrichment of 
CA motifs as hnRNP L binding sites was confirmed. A position-dependent splic- 
ing regulation was demonstrated: while binding to intronic regions upstream of 
alternative exons leads to repressed splicing, binding to the downstream intron 
activates splicing [117]. 

Concerning their role as splicing regulators, many examples of hnRNPs and 
hnRNP-like proteins show negative regulation, including Nova 1 [118], hnRNP 
Al [119], Fox2 [120], HuR [121], hnRNP H [122], hnRNP F [123], and PTB [124, 
125]. A positive regulation has been shown for hnRNP A1 [126], hnRNP H [127], 
hnRNP G [128], and PTB [129]. Similar to SR proteins, hnRNPs show a position- 
dependent effect on splicing regulation [130]. 


7.5.3 Mechanisms of Splicing Regulation 


It is frequently difficult to make a clear distinction between “constitutive” and 
“alternative” splicing. The decision depends on cis-acting elements like strong or 
weak splice sites (a higher degree of similarity to the consensus sequence 
increases splice site strength) and additional enhancer or silencer elements in the 
vicinity of the splice sites. Furthermore, the abundance and concentration of 
each splicing factor in a given cell type affects the splicing decision [131-133]. 
SR proteins are the main enhancers known to facilitate splice site recognition 
and exon inclusion by binding to ESEs. In general, they help components of the 
spliceosome to bind the pre-mRNA. This includes the recruitment of the U1 
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snRNP to the 5’ splice site. Additionally, they recruit the U2AF heterodimer and 
U2 snRNP to the 3’ splice site and help establish exon definition complexes (see 
Section 7.2.4) [76, 134, 135]. Their activity is mediated by their RS domain and 
the phosphorylation status (reviewed in [86]). The enhancing effect SR proteins 
exert on exon inclusion is position dependent. Binding of SR proteins to intronic 
regions can induce exon skipping [136]. Furthermore, binding of SR proteins to 
exonic regions can have a differential impact on the inclusion of cassette exons. 
Binding of SR proteins to ESEs within the cassette exon enhances its inclusion, 
whereas binding to ESEs within the flanking constitutive exons promotes skip- 
ping of the cassette exon [137, 138]. 

SR proteins can cooperate to promote exon inclusion. Different SR proteins 
can recognize the same ESE and compensate for each other or act cooperatively 
by binding to adjacent ESEs [69]. Additionally, SR proteins may form larger com- 
plexes with other RS domain-containing proteins, such as the SR-related nuclear 
matrix proteins SRm160 and SRm300, which are unable to bind RNA by them- 
selves. These coactivators can form multiple interactions with snRNPs and 
enhancer-bound SR proteins; thus they enhance activity through bridging inter- 
actions between ESEs and spliceosomal components [139]. 

The splicing process can be inhibited by various mechanisms. Often, hnRNPs 
like PTB or hnRNP A1 are involved. The simplest way of inhibition is sterically 
blocking positive regulators. This happens when silencer elements are located 
closely to splice sites or to splicing enhancer elements, so that splicing is inhib- 
ited by blocking the access of snRNPs or positive regulatory factors. PTB, for 
example, binds the polypyrimidine tract and therefore blocks binding of U2AF to 
alternatively spliced exons [125]. Several other mechanisms by which PTB inhib- 
its splicing have been elucidated [140]. It can inhibit U2AF binding also when 
bound to exonic sequences [141]. PTB binding to ISSs can inhibit the transition 
from an exon definition to an intron definition complex [142] or prevent interac- 
tion of the U1 snRNP with other spliceosomal components [143]. Furthermore, 
PTB might induce exon skipping by looping out exons flanked by intronic PTB 
binding sites [144]. 

Like SR proteins, hnRNPs can cooperate to inhibit exon inclusion [68]. 
Recently, it was shown that inclusion of the cd45 exon 4 is repressed by hnRNP 
L binding to an ESS. HnRNP L recruits hnRNP A1 and together the two hnRNPs 
induce extended contacts of the U1 snRNP with exonic sequences, preventing 
U6 snRNP contacts with the 5’ splice site and subsequent spliceosomal 
catalysis [145]. 

Splicing of individual pre-mRNAs usually involves the integration of additive 
and competitive signals from both splicing activating and repressing elements. 
Along this line, SR proteins can induce exon inclusion by competing with repress- 
ing hnRNPs. One example is the role of hnRNP A1 in the repression of exon 3 of 
the HIV1 tat pre-mRNA. An ESS in exon 3 binds the repressor hnRNP A1 with 
high affinity and inhibits splicing by propagating the binding of further hnRNP 
Al proteins toward the 3’ splice site. This propagative binding can be inhibited 
by the binding of the SR protein SRSF1 to an upstream ESE, which then activates 
splicing. Additionally, hnRNP Al binds an ISS located upstream of exon 3, 
thereby preventing binding of the U2 snRNP [7, 119, 146]. 
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It is important to note that the same splicing factors can stimulate as well as 
inhibit the inclusion of cassette exons depending on their respective binding site 
position. This has been shown for SR proteins (see previous text) and hnRNPs. 
For example, hnRNP H has been shown to promote exon inclusion when bound 
to intronic positions, but induce exon skipping when bound to exonic sequences 
[147, 148]. The hnRNP-like splicing factor Noval is exclusively expressed in CNS 
neurons and recognizes YCAY clusters. A genome-wide map revealed that the 
position of its binding site relative to the regulated exon dictates if Noval pro- 
motes exon inclusion or skipping [118]. 

The accumulated knowledge on the impact of cis-regulatory motifs, exon fea- 
tures (e.g., length, splice site strength), and RNA structure was successfully com- 
bined to build a “splicing code” that accurately predicts tissue-specific expression 
of alternatively spliced cassette exons [149]. 


7.5.4 Transcription-Coupled Alternative Splicing 


Splicing is not only controlled by a plethora of different splicing factors, but it is 
also coupled to transcription, already shown in early studies [150]. Global 
sequencing analyses of multiple tissues and cell types in different organisms indi- 
cate that co-transcriptional splicing is widespread ([25, 151-157], reviewed in 
[158]). In budding yeast, fly, and human cell lines and tissues, the vast majority of 
introns are co-transcriptionally spliced [25, 151, 154, 156, 157]. Due to their 
experimental and analytical differences, it is sometimes hard to compare the 
studies. While some findings show that intron length negatively correlates with 
co-transcriptional splicing frequency in mouse, human, and fly [25, 155, 156], 
another study, focusing on highly expressed genes with long introns, came to the 
exact opposite conclusion [151]. However, numerous studies agree that constitu- 
tive splicing occurs to a greater degree in a co-transcriptional manner than alter- 
native splicing [25, 151, 155, 156]. One study in mouse macrophages found that 
full-length yet incompletely spliced transcripts accumulated in the chromatin 
fraction [152]. The relatively low frequency of co-transcriptional splicing in this 
and in another mouse study [155] is in contrast to the high numbers found in 
yeast, fly, and human cells. To provide clear evidence, analysis of directly compa- 
rable human and mouse cell types should be addressed. 

The alternative splicing decision can be influenced by several elements, 
including promoters [159, 160], transcription factors [161, 162], and coactivators 
[163-165], as well as transcription enhancers [166], chromatin remodelers [167], 
and factors affecting chromatin structure [168-172]. Two models are currently 
discussed that are not mutually exclusive: the recruitment model and the kinetic 
model (reviewed in [173]). 

The recruitment model involves the recruitment of splicing factors to tran- 
scription sites by the transcription machinery. The carboxy-terminal domain 
(CTD) of RNA polymerase II (Pol II) has a key role in functionally coupling tran- 
scription to capping and 3’ processing. Additionally, several alternative splicing 
factors associate with the CTD, implicating this domain in alternative splicing. 
One example is the splicing factor SRSF3, which interacts with the CTD and 
inhibits inclusion of cassette exon 33 in the fibronectin mRNA [174]. 
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The kinetic model proposes that the rate of transcription elongation affects 
the outcome of alternative splicing. One possibility is that upon Pol II pausing or 
slowing down, inclusion of alternative exons increases. An upstream exon with 
a weak 3’ splice site can be defined, before the downstream exon is synthesized, 
resulting in exon inclusion at slow transcription rates but exclusion at fast tran- 
scription kinetics. Other mechanisms include a Pol II “roadblock” upon DNA 
binding by proteins, like the CCCTC-binding factor (CTCF), which stalls the 
Pol II complex and therefore promotes inclusion of the alternative exon 5 in 
cd45 [175]. 

As histone modifications directly affect Pol II extension speed, they can also 
have an impact on alternative splicing. Exons show increased nucleosome occu- 
pancy, probably caused by their higher GC content compared with the flanking 
intronic regions [176-178]. Furthermore, the histones associated with exons are 
enriched in certain modifications, which influence alternative splicing decisions 
(reviewed in [179]). Trimethylation of histone H3 lysine 9 (H3K9me3) is corre- 
lated with transcriptional repression. Enrichment of H3K9me3 marks on alter- 
native exons in the cd44 gene has been shown to increase exon inclusion [171]. 
The H3K9me3 modification is recognized by the chromodomain protein HP1y, 
which reduces the local elongation rate of Pol II. Conversely, an increase in the 
Pol II transcription rate by increased histone 3 lysine 9 acetylation (H3K9ac) 
leads to skipping of the ncam exon 18 [172]. 

Similar to the recruitment model discussed earlier, proteins recognizing spe- 
cific histone modifications have been shown to modulate alternative splicing by 
recruitment of splicing factors. One example is trimethylation of histone 3 lysine 
36 (H3K36me3). This mark can be recognized by the Mrg15 (MORF-related 
gene 15) protein, which recruits PTB to an ISS near a mutually exclusive exon in 
fefr2 (fibroblast growth factor 2), repressing its inclusion in mesenchymal cells 
[169]. Furthermore, it has been proposed that the short isoform of Psip1 (PC4 
and SF2 interacting protein 1) enhances exon inclusion by recruitment of the 
splicing factor SRSF1 to H3K36me3 marks [170]. 


7.5.5 Alternative Splicing and Nonsense-Mediated Decay 


Apart from increasing protein diversity, alternative splicing can also result in 
mRNA degradation via the nonsense-mediated mRNA decay (NMD) pathway. 
NMD is one of several RNA surveillance mechanisms to ensure the accuracy 
of gene expression by degrading mRNAs that contain a premature termination 
codon (PTC). At first, it was thought that NMD only removes defective mRNAs 
arising from errors in gene expression to avoid accumulation of truncated, non- 
functional proteins [180]. Nowadays, it is known that alternative splicing can 
introduce PTCs and exploit NMD to achieve quantitative posttranscriptional 
regulation [181]. In mammals, a stop codon is recognized as premature if it is 
located >50—55 nucleotides upstream of an exon—exon junction, which is marked 
by an accumulation of several proteins and called an exon junction complex 
(EJC) [182]. According to this rule, one third of the human alternative mRNA 
isoforms in the RefSeq database were predicted to be subject to NMD [183]. 
However, upon siRNA-mediated depletion of the NMD factor UPF1 in HeLa 
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cells, only 10% of PTC-containing genes were slightly increased in their mRNA 
levels [184, 185]. Such a rather minor role is in accordance with a microarray 
study, showing uniformly low levels of PTC-containing splice variants across 
diverse mammalian cell types and tissues [186]. 

Although the usage of alternative splicing coupled to nonsense-mediated 
mRNA decay (AS-NMD) might be less prevalent than indicated by initial com- 
putational surveys, this process is pivotal in regulating the expression of certain 
gene families. Among other RNA-binding proteins, AS-NMD was shown to be 
prevalent for members of the SR protein and hnRNP families, indicating that it is 
an important mechanism for the homeostatic regulation of splicing factors [80, 
187, 188]. 

Further, recent work has shown the function of NMD during cellular differen- 
tiation and in response to stress, regulating the expression of certain splicing 
isoforms (reviewed in [189]). Coupled to the observation that the deletion of 
NMD factors is embryonic lethal in mouse [190-193], these findings emphasize 
the importance of this mRNA surveillance mechanism for the maintenance of 
physiological processes. 


7.5.6 Alternative Splicing and Disease 


Aberrant splicing has been recognized as the cause of several diseases and also 
appears to drive cancer progression [194—196]. 15% of the known disease-caus- 
ing single nucleotide polymorphisms (SNPs) are located within splice sites, and 
>20% in predicted splicing elements [197, 198]. A comprehensive list of diseases 
caused by mutated 5’ and 3’ splice sites including cystic fibrosis, Alzheimer’s 
disease, and several types of cancer is available at the database for aberrant splice 
sites (DBASS) [199]. 

Mutations of cis-acting elements can result in several aberrant splicing events: 
mutations disrupting exon definition, for example, in ESEs, 5’ or 3’ splice sites, 
often lead to exon skipping, resulting in nonfunctional proteins, or in the case of 
frame shifting to the introduction of PTCs. One example for disrupted exon defi- 
nition is spinal muscular atrophy, which is described later. Similar effects are 
seen with mutations that activate cryptic splice sites, resulting in the retention of 
intronic sequences. Such activation of cryptic splice sites was already described 
in 1982 to cause f-thalassemia [200]. Furthermore, mutations in silencer or 
enhancer elements affecting the inclusion ratio of cassette exons do not alter the 
encoded mRNA/protein isoforms, but nevertheless can induce pathological 
effects as isoform ratios are important cell-type-specific determinants. For 
example, several intronic SNPs in the neuregulin receptor erbB4 are associated 
with the increased expression of splicing isoforms upregulated in patients with 
schizophrenia [201]. 

Mutations in trans-acting factors can also have a severe impact on splicing 
regulation. Consistently, their occurrence in core spliceosomal factors is very 
rare, suggesting that mutations with an impact on the basal splicing machinery 
are embryonic lethal. The few known examples include mutations in the splicing 
factor SF3B1 (a component of the U2 snRNP) that are frequently observed in 
leukemia patients [202]. In contrast, mutations in splicing factors important for 
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alternative splicing are more frequent. Examples are mutations in TDP-43 and 
FUS, which are connected to amyotrophic lateral sclerosis (ALS) and other neu- 
rodegenerative disorders [203, 204]. 

Consequently, modulation of disease-causing aberrant splicing is used as a 
therapeutic approach [196, 205]. In spinal muscular atrophy, a motor neuron 
disease, the smn2 gene is the only source for the essential survival motor neuron 
(SMN) protein due to an inactivation of smm1. Inefficient inclusion of exon 7 in 
the smn2 mRNA, due to a silent mutation disrupting an ESE, leads only to the 
production of residual amounts of full-length protein [206]. Antisense oligonu- 
cleotides (ASOs) have been developed that force inclusion of exon 7 by masking 
a downstream ISS. ASOs are small oligonucleotides that base pair with exons, 
splice sites, or splicing factor binding sites to subsequently modulate splicing 
decisions [205]. This leads to the increased production of functional SMN pro- 
tein, resulting in enhanced motor neuron function and survival (from 10 to 
>500 days) in a mouse model of severe disease [207]. One of these ASOs is now 
the first antisense drug functioning via splicing correction and the first FDA- 
approved treatment for SMA [208]. In general, ASOs show high efficacy, delivery 
to several tissues, the ability to cross the blood—brain barrier and, until now, no 
severe side effects, making them promising new therapeutics for the treatment 
of splicing-related diseases. 


7.6 Controlled Splicing in S. cerevisiae 


7.6.1 Alternative Splicing 


Alternative splicing events in S. cerevisiae are rare with only three examples 
known so far. The most extensively alternatively spliced gene is the nuclear 
export factor mtr2 [33]. Mtr2 contains an intron in its 5’ UTR, which includes 
two 5’ splice sites and three 3’ splice sites. Five of the six possible combinations 
and the unspliced transcript are detectable. The six different transcripts either 
encode proteins with different N-termini or 5’ UTRs containing differing num- 
bers (up to three) of upstream open reading frames (uORFs). The function of 
these different encoded proteins/5’ UTRs or how splice site selection is regulated 
is unknown. 

A further example for alternative splice site usage in S. cerevisiae is srcl. SRC1 
acts in sub-telomeric gene expression and TREX-dependent mRNA export 
[209]. Its intron contains two overlapping 5’ splice sites: GCAAGUGAGU (No. 1 
underlined, No. 2 bold [210]). Usage of the downstream 5’ splice site results in 
the expression of a long protein isoform that codes for two transmembrane 
domains [209]. Usage of the upstream 5’ splice site results in a shorter protein 
with only one transmembrane domain and reduced activity. Again, it is not 
known how (and if) splice site selection is regulated. 

In S. cerevisiae, three SR-like homologs (NPL3, HRB1, and GBP2) and one 
hnRNP-like protein (HRP1) have been identified. Mutagenesis studies indicate 
that only NPL3 may be involved in splicing [211]. However, RNA-binding pro- 
teins important for the splicing of individual transcripts have been reported. 
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Mer1 is transcribed only during meiosis and activates splicing by binding to an 
intronic enhancer sequence (AYACCCUY) near the 5’ splice site. Splicing activa- 
tion by MER1 is further dependent on a reduced basal splicing efficiency of the 
introns (e.g., a non-consensus 5’ splice site) and on the NAMB protein, which is 
part of the U1 snRNP [212]. 

Further on, intron retention can be regulated by a negative feedback loop of 
the encoded protein: overexpression of the RNA export factor YRA1 is toxic to 
cells. Therefore, its expression has to be tightly regulated. YRA1 restricts its own 
expression by inhibiting the splicing of a highly unusual intron in its ORF. With 
766 nucleotides, this intron is very large: it is located 300 nucleotides down- 
stream of the AUG and contains a non-consensus branch point (GACUAAC). 
All these unusual features seem to be important for autoregulation, which relies 
on a suboptimal splicing efficiency and co-transcriptional binding of YRA1 
[213, 214]. The unspliced pre-mRNA is exported to the cytoplasm, where its 
degradation is initiated by EDC3-activated decapping and completed by XRN1 
digestion. 

In addition to these cases, in which a specific protein regulates one (or four) 
specific transcripts, the spliceosome itself might differentiate between different 
introns. Genome-wide studies of changes in splicing efficiency after the intro- 
duction of mutations in 18 core spliceosomal components revealed several 
transcript specific effects [215]. This implies that not only specialized factors 
but also the core spliceosome machinery itself can influence differential splicing 
decisions. 


7.6.2 Regulated Splicing 


Instead of alternative splicing, “regulated splicing” is predominantly found in 
S. cerevisiae. There, nonfunctional introns are retained in the mature mRNA, 
introducing PTCs that ultimately lead to mRNA decay. The degradation of 
unspliced pre-mRNAs can occur in the nucleus involving the exosome. 
Additionally, intron-containing mRNAs can be exported to the cytoplasm, where 
they are degraded by either the 5’ to 3’ exonuclease XRN1 or the NMD pathway. 
The decision, if an intron-containing mRNA is directed to the NMD pathway, 
depends on the intron’s identity [216]. 

The most prominent example of regulated splicing in S. cerevisiae occurs dur- 
ing meiosis. All 13 of the intron-containing genes related to meiosis are spliced 
inefficiently during exponential growth in rich medium, but splicing is dramati- 
cally induced during sporulation [217]. This regulation mechanism seems to 
depend on the competition of meiosis-related genes with intron-containing 
ribosomal proteins for the splicing machinery [218]. During meiosis, the expres- 
sion of ribosomal proteins is temporary repressed. During this time period, the 
global splicing efficiency, including splicing of meiosis-related genes, is improved. 
Ribosomal proteins comprise ~90% of all splicing substrates during vegetative 
growth, outcompeting other intron-containing pre-mRNAs for the splicing 
apparatus. Therefore transcriptional repression of ribosomal proteins leads to an 
overall change in the composition of nuclear pre-mRNAs, ultimately allowing for 
efficient splicing of otherwise inefficiently spliced meiosis-related pre-mRNAs. 


7.7 Splicing Regulation by Riboswitches 


7.6.3 Function of Splicing in S. cerevisiae 


In contrast to the finding that introns seem to be beneficial for high gene expres- 
sion, most of them can be deleted without affecting growth in rich medium [219, 
220]. Also, multiple deletions of introns in one strain, for example, all 16 introns 
within the 15 intron-containing cytoskeleton-related genes, showed no impact 
on growth under standard laboratory conditions. Only the deletion of introns in 
RNA-binding proteins caused growth defects in rich medium. There, introns 
seem to be important for gene expression by the endogenous promoter, as heter- 
ologous expression of the genes from an act1 promoter restored cell growth. 

Introns also seem to be important for fine-tuning gene expression and growth 
under stress conditions. Parenteau et al. systematically deleted the introns of all 
ribosomal proteins and investigated their impact on gene expression, rRNA 
maturation, and growth under stress conditions [219]. They found that 21% of 
the intron deletions inhibited growth in the presence of staurosporine and 37% 
affected growth during at least one of five stress conditions tested. Furthermore, 
intron deletions did not only alter the expression of the respective gene but also 
affected pre-rRNA processing and expression of the paralog in duplicated 
genes [219]. This shows that, albeit in most cases, the few remaining introns in 
S. cerevisiae are not important for cell survival per se, but they do play a role in 
posttranscriptional gene regulation and therefore might not be readily expelled 
from the genome in future evolution. 

Recently, a novel role for the spliceosome in the regulation of intronless genes 
has been discovered [221]. Intronless genes containing consensus 5’ splice sites 
and branch point sequences are bound by the spliceosome and spliced (at least 
the first step of splicing is performed). The incorrectly spliced pre-mRNA is sub- 
sequently degraded, leading to the downregulation of gene expression. The 
authors suggest that the expression of ~1% of the intronless genes in S. cerevisiae 
is regulated by this so-called spliceosome-mediated decay (SMD) mechanism. 

The advancement of novel high-throughput sequencing methods allows for an 
unprecedented in-depth analysis of expressed isoform variants. Consequently, 
several recent transcriptomic studies discovered novel introns and usage of alter- 
native splice sites in S. cerevisiae, significantly expanding the role of splicing in 
this “intron-poor” eukaryote [222-224]. 


7.7 Splicing Regulation by Riboswitches 


A decade ago, a novel RNA-based regulatory mechanism was discovered. The 
so-called riboswitches are structured RNA elements usually residing in the 5’ 
UTR of bacterial genes [225]. Riboswitches consist of two domains: an aptamer 
domain and an expression platform. These aptamer domain senses the amount 
of a small molecule ligand. Its binding leads to a structural rearrangement, which 
is translated to the expression platform, subsequently modulating gene expres- 
sion. Most bacterial riboswitches regulate gene expression by either transcrip- 
tional termination or translational repression. They are found predominantly in 
genes related to the metabolic pathways of their cognate ligand. 
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During the last years, a plethora of different riboswitch classes sensing a diverse 
set of ligands has been discovered. In most cases the ligands are nucleobases, 
amino acids, or coenzymes, but also ion- and second messenger-sensing ribos- 
witches have been identified (reviewed in [225—227]). 


7.7.1 Regulation of Group | Intron Splicing in Bacteria 


Recently two structurally different classes of riboswitches sensing the second 
messenger cyclic diguanylate (c-di-GMP) were discovered [228, 229]. In Clo- 
stridium difficile, a class II c-di-GMP riboswitch was identified upstream of a 
group I self-splicing intron [228]. Here, the start codon of the downstream 
gene is engaged in base pairing interactions with sequences of the group I 
intron structure. Furthermore, the intron contains an atypical 5’ splice site that 
is partially sequestered by base pairing with an anti-5’ splice site sequence. In 
the absence of c-di-GMP, guanosine triphosphate (GTP) cannot attack the 
sequestered 5’ splice site, but attacks a site near the 3’ splice site, resulting in a 
nonfunctional mRNA, with an accessible start codon, but lacking a ribosomal 
binding site (rbs) (see Figure 7.3a, middle). Formation of the riboswitch struc- 
ture in the presence of c-di-GMP leads to disruption of the anti-5’ splice site 
stem. The correct 5’ splice site can now be attacked by GTP and the group I 
intron removed completely. As a result, the start codon is not longer seques- 
tered. In addition, a functional rbs is created by the joining of the exon 
sequences (see Figure 7.3a, left side). Both events ultimately lead to gene 
expression. 

Apart from the allosteric activation of self-splicing, the c-di-GMP riboswitch 
in C. difficile can regulate translation of the downstream gene in a second step 
[230]. After removal of the group I intron, the upstream riboswitch lies next to 
the newly created rbs. Access to the rbs is then regulated by binding of c-di-GMP 
to the aptamer domain like in classical translation-controlling riboswitches (see 
Figure 7.3, right side and bottom). To this date, this is the only example of a ribos- 
witch regulating a self-splicing intron. 


7.7.2 Regulation of Alternative Splicing by Riboswitches in Eukaryotes 


So far only one riboswitch class — the thiamine pyrophosphate (TPP) ribos- 
witch — has been found in all three domains of life. In all identified cases, the 
eukaryotic riboswitches are located in introns and regulate alternative splicing in 
a TPP-dependent manner. Depending on the organism, the intronic sequences 
containing the riboswitch are located in different parts of the pre-mRNA, and 
alternative splicing subsequently triggers different downstream effects [231, 
232]. In Aspergillus oryzae, a TPP riboswitch resides in an intron in the 5’ UTR 
of a thiamin biosynthetic gene [233]. In Neurospora crassa, three TPP ribos- 
witches have been identified [234]. Two of them reside in an intron in the 5’ UTR 
(see Figure 7.3b). There, the intron encodes two 5’ splice sites. Under conditions 
of low TPP concentration, the upstream 5’ splice site is used, leading to the com- 
plete removal of the intronic sequence and high levels of gene expression. When 
the TPP concentration is high, the downstream 5’ splice site mediates partial 
retention of the intronic sequence. The retained sequence introduces a uORF, 
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Figure 7.3 Splicing regulation by riboswitches. (a) In the bacterium C. difficile, a c-di-GMP binding riboswitch regulates splicing of 
a group | intron. Depending on the presence of the ligand, different mRNAs are produced by alternative splicing. Left: Upon 
c-di-GMP binding to the riboswitch, an otherwise sequestered 5’ splice site (indicated in red) becomes accessible to the cofactor 
GTP, and the complete group | intron is removed. Therefore, joining of the exon sequences creates an accessible ribosomal binding 
site (rbs) and the downstream gene can be expressed. Middle: In the absence of c-di-GMP, the correct 5’ splice site (indicated in 
red) is inaccessible for GTP attack. GTP attack on a downstream site (indicated in pink) occurs, creating a truncated mRNA without 
a ribosomal binding site. Therefore, gene expression is inhibited. Right: In very rare cases, the group | intron is correctly spliced in 
the absence of c-di-GMP. Nevertheless, gene expression does not occur in the absence of c-di-GMP, as the newly created rbs is 
sequestered within the riboswitch. In cases where the complete group | intron has been removed, subsequent c-di-GMP binding to 
the aptamer domain of the riboswitch leads to structural rearrangements, rendering the rbs accessible. Gene expression can thus 
be switched on or off, depending on the ligand binding state of the riboswitch. (b) In the filamentous fungus N. crassa, two genes 
harbor a TPP riboswitch within an intron in their 5’ UTR. Both introns contain two 5’ splice sites. Top: In the absence of TPP, the 
downstream 5’ splice site (pink) is sequestered by base pairing interactions with the free TPP aptamer domain. Consequently, the 
upstream 5’ splice site (red) is used, leading to complete intron removal and subsequently to gene expression. Bottom: In the 
presence of TPP, the aptamer domain binds its ligand, which renders the downstream 5’ splice site accessible to the spliceosome. 
Thus, a part of the intron is retained after splicing, introducing a uORF, which inhibits gene expression. (c) In higher plants (e.g., 
Arabidopsis thaliana), TPP aptamer domains are found in introns within 3’ UTRs. There, gene expression is regulated by usage of 
two different 3’ processing sites (diamonds). Top: In the absence of TPP, the 5’ splice site is sequestered by the aptamer domain, 
leading to intron retention. 3’ processing occurs at the upstream site (red) encoded within the intron. This leads to a stable mRNA 
with a short 3’ UTR and gene expression. Bottom: In the presence of TPP, the 5’ splice site is accessible and the upstream 3’ 
processing site is removed along with the intron. Usage of the downstream 3’ processing site (pink) leads to an MRNA with a long 
3’ UTR. This MRNA is unstable, as long 3’ UTRs in plants trigger NMD. Therefore, gene expression is repressed. AAA = poly(A) tail, 
c-di-GMP = cyclic diguanylate, GTP = guanosine triphosphate, m7G = 7-methylguanosine cap, ORF = open reading frame, 

rbs =ribosomal binding site, ss=splice site, TPP =thiamine pyrophosphate, uORF = upstream open reading frame. 
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which negatively affects gene expression. In the third case, the intron containing 
the riboswitch is located in the main ORF of the gene. When the TPP level is low, 
the intron is removed completely, resulting in gene expression [235]. TPP bind- 
ing leads to incomplete intron removal by usage of downstream 5’ splice sites, 
disrupting the main ORF by the introduction of frame shifts. So, in all three 
cases, TPP leads to the downregulation of gene expression by modulation of 5’ 
splice site usage. 

In higher plants, TPP riboswitches reside in the 3’ UTRs and control intron 
retention by regulating the accessibility of the 5’ splice site (see Figure 7.3c) 
[236, 237]. At low TPP concentrations, the 5’ splice site is inaccessible and a 
stable mRNA with a short 3’ UTR is expressed. In the presence of high amounts 
of TPP, the 5’ splice site is accessible and the intron in the 3’ UTR is removed. 
Splicing of the intron also removes the major 3’ end processing site. As a result, 
another downstream 3’ processing site is used, leading to an elongated 3’ UTR 
that induces degradation by NMD. 

In all cases studied, sequences within the aptamer domain of the TPP ribos- 
witch base pair with splicing signals (usually the 5’ splice site), rendering them 
inaccessible for the spliceosome. This sequestration of splicing signals then 
triggers the usage of alternative 5’ splice sites, exon skipping, or intron reten- 
tion. As the base pairing sequences in the TPP riboswitch are part of the ligand 
binding pocket, binding of TPP leads to structural rearrangements, which ren- 
ders the splicing signals accessible. Subsequent downstream events then 
repress gene expression. This is achieved either by translational repression due 
to uORFs in the 5’ UTR or by triggering NMD via PTCs or the length of the 3’ 
UTR. 

The TPP riboswitch in N. crassa located in the intron of the main ORF is an 
interesting exception (see preceding text). Here, the base pairing interactions in 
the absence of TPP do not regulate alternative splicing by 5’ splice site sequestra- 
tion, but facilitate intron removal [235]. This is achieved by a long-range interac- 
tion between the aptamer domain and several conserved nucleotides downstream 
of the 5’ splice site. It seems that, upon structure formation, reducing the effec- 
tive distance between the 5’ and 3’ splice sites enhances splicing efficiency. 

Until now, no eukaryotic homologs have been identified for the other ribos- 
witch classes discovered in bacteria and archaea. Still, the report of a putative 
arginine binding riboswitch, present in an intron in the 5’ UTR of an arginase in 
the fungus Aspergillus nidulans, suggests that other eukaryotic riboswitches 
might exist [238]. 


7.8 Splicing and Synthetic Biology 


7.8.1 Impact of Introns on Gene Expression 


Splicing is tightly linked with all stages of mRNA metabolism, including tran- 
scription, mRNA processing, nuclear export, and translation. Intron sequences 
may harbor transcriptional regulatory elements or affect DNA accessibility 
by determining nucleosome arrangement, influence export processes, mRNA 
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stability, and the translatability of the mRNA (reviewed in [239]). As a conse- 
quence, the expression of a gene can change dramatically depending on the 
presence or absence of introns. 

Gene expression is usually reduced upon removal of endogenous introns [240]. 
So the maintenance of introns can lead to considerable higher protein expression, 
for example, for overexpression studies [241]. It is exemplified in Figure 7.4a, 
where the expression of the MAX (Myc associated factor X) protein from an 
intronless cDNA was compared with a cDNA retaining one endogenous intron. 

Also the heterologous expression of transgenes can be increased significantly 
by adding just a single generic intron [242-245]. The extent of the effect depends 
on intron identity, intron position within the gene, and the surrounding exonic 
sequences. Placing the same intron between different exons may yield opposing 
results [241, 246]. The insertion of intron 2 of the B-globin gene into a firefly 
luciferase reporter gene increased its expression 3-fold, the insertion of a syn- 
thetic intron only 1.5-fold (see Figure 7.4b). In contrast, insertion of B-globin 
intron 1 led to undetectable reporter activity. Therefore, the inability of B-globin 
intron 1 to confer efficient splicing in a heterologous context is apparently due to 
its weak splicing signals and missing enhancing sequences in the artificial con- 
text [247]. The insertion of two short introns from the immunoglobulin heavy 
chain into both a green fluorescent protein (GFP) reporter gene and a Cre recom- 
binase cDNA increased gene expression up to 30-fold in CHO cells. These 
introns were chosen because they were short, compatible with high levels of gene 
expression, and without evidence of containing regulatory sequences [248]. In 
line with this approach, several commercially available expression vectors also 
contain short synthetic introns in their 5’ UTRs known to enhance the stability 
of the mRNA by influencing polyadenylation [249]. 


7.8.2 Control of Splicing by Engineered RNA-Based Devices 


The controlled removal of intronic sequences offers the possibility to engineer 
user-defined gene expression systems. RNA-based control devices generally 
couple in vitro selected RNA aptamers as sensory domains to functional RNA 
domains (like a rbs, splice site, or a ribozyme). By modulating the accessibility of 
elements essential for splicing, such as the 5’ splice site, the branch point, or the 
3’ splice site, engineered riboswitches have been shown to control both constitu- 
tive and alternative splicing. 

In a pioneering study, a theophylline-binding aptamer was inserted close to a 
3’ splice site. The addition of the ligand theophylline resulted in a 4-fold reduc- 
tion of gene expression in an in vitro splicing assay [250, 251]. The data indicated 
that theophylline binding specifically blocked the recognition of the 3’ splice site. 
This aptamer was also used to modulate splicing efficiency by including the 
branch point sequence into the aptamer sequence [252]. In the presence of theo- 
phylline, the downstream exon was skipped twice more often than in its absence, 
indicating that engineered riboswitches can also modulate and therefore investi- 
gate the impact of alternative splicing. 

A tetracycline-binding aptamer was used to regulate pre-mRNA splicing in 
yeast [253]. The aptamer was inserted into a yeast intron in close proximity to 
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Figure 7.4 Influence of introns on gene expression. (a) Western blot analyses after 
overexpression of MAX protein either lacking (w/o) or containing one endogenous intron in 
the ORF. Overexpression was performed in HeLa cells and protein was isolated after 24h. 
B-Actin was used as a loading control. (b) Firefly luciferase reporter gene constructs without 
(w/o) intron or containing the B-globin introns 1, 2 or a synthetic intron. Firefly luciferase 
activity was measured in triplicates 24h after transfection of HeLa cells using Renilla luciferase 
as transfection control. bgl=6-globin, n.d.=not determined. 


either the 5’ splice site or the branch point. Maximal regulation (16-fold) was 
observed with a construct in which the 5’ splice site was masked by intramolecu- 
lar base pairing when placed within the closing stem of the aptamer. This blocked 
the accessibility of the 5’ splice site to the U1 snRNP. The dynamic range of regu- 
lation was increased by additionally inserting a second aptamer-containing intron. 

Another programmable control device expands the possibility to engineer 
alternative splicing by being triggered by the presence of specific protein binding 
to an aptamer located in the intronic sequence. This approach has been success- 
fully used to rewire both the Wnt and nuclear factor «B signaling pathways in 
mammalian cells [254]. In plants a naturally occurring rRNA-mimicking struc- 
ture was used to regulate cassette exon splicing in response to the expression of 
a ribosomal protein. By using an engineered variant of the RNA structure from 
another plant species highly efficient, orthogonal gene activation could be 
achieved in Nicotiana benthamiana [255]. 
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Up to now, the number of artificially engineered systems used to control pre- 


mRNA splicing is limited, but the examples presented impressively demonstrate 
that synthetic devices have an immense potential for controlling splicing and, 
thus, both level and identity of target gene expression. 


7.9 Conclusion 


Besides the importance of splicing for increasing proteome diversity, there is a 
clear impact of introns on gene expression levels with introns often stimulating 
but sometimes also reducing gene expression. There is no universal requirement 
for introns, but their presence has to be carefully considered during the de novo 
design of genetic pathways. Moreover, given the great importance of RNA splic- 
ing for gene regulation per se, RNA elements that target splicing may soon provide 
general and highly applicable platforms for engineering gene regulation systems. 
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Definitions 


Splicing removal of intronic sequences from the pre-mRNA 

Exon sequences of a gene included into the mature mRNA 

Intron intervening sequences removed upon splicing 

Spliceosome machinery that removes introns from the pre-mRNA 
Alternative splicing generation of multiple mature mRNA molecules from a 


single gene 
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8.1 Utility of the RNAi Pathway for Application 
in Mammalian Cells 


RNA interference (RNAi) is an efficient and convenient tool for transient gene 
suppression (knockdown) in biomedical research. RNAi is beneficial for genetic 
screening and basic studies involving loss-of-function phenotypes and as an 
alternative protein inhibitor to small molecule drugs [1]. Since the first discovery 
of the RNAi phenomena in Caenorhabditis elegans [2], intensive genetic and 
biochemical research has uncovered the molecular mechanisms underlying 
RNAi and identified analogous pathways and molecules to control RNAi in 
eukaryotes [3-5]. 

In mammalian systems, RNAi is induced when microRNA (miRNA), short 
hairpin RNA (shRNA), or small interfering RNA (siRNA) harness the endoge- 
nous processing pathway and machinery (Figure 8.1). In the endogenous RNAi 
pathway, primary miRNA (pri-miRNA) embedded in coding or noncoding RNA 
is transcribed from genetic or plasmid DNA by RNA polymerase II or II and is 
cleaved at the base region of the stem-loop structure (two black wedges) by the 
RNase III nuclease Drosha. The cleaved stem-loop precursor miRNA (pre- 
miRNA) is recognized by Expotin-5b proteins, exported from the nucleus to the 
cytoplasm and processed into mature miRNA (two black wedges) by another 
RNase III nuclease, Dicer. Then, one strand of the mature miRNA is selected and 
introduced into Ago2 to activate sequence-specific mRNA degradation and 
targeted gene repression. shRNA expressed from plasmids is exported to the 
cytoplasm and processed only by Dicer in a similar manner to transfected shRNA 
or siRNA molecules (Figure 8.1). 

From a synthetic biology perspective, RNAi is a suitable and potent technology 
for the development of genetic devices to rewire cell signaling. It is important to 
generate RNAi-modulated genetic devices that detect target input molecules 
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Figure 8.1 Schematic of the RNAi pathway in mammalian cells, including endogenous miRNA 
processing and ectopic shRNA or siRNA expression. 


(triggers) and control RNAi-mediated gene expression (referred to as RNAi 
switches). To obtain such RNAi switches, appropriate RNA sequences located in 
(pri)(pre)-miRNAs or shRNAs have been engineered to modulate the recogni- 
tion of RNA-processing nucleases such as Drosha and Dicer, making it possible 
to bind to various triggers and inhibit or permit nuclease processing. To design 
the trigger molecule-controlled RNAi switches, it is useful to isolate functional 
RNA modules based on RNA secondary structures because RNA is often divided 
into functional modules and reassembled through the double-stranded regions 
without disrupting the original function. Synthetic RNAi switches have been 
developed by employing various trigger molecules (e.g., small molecules, RNA, 
or proteins) that take advantage of the modularity of RNA. 


8.2 Development of RNAi Switches that Respond 
to Trigger Molecules 


Control of gene expression from exogenous DNA by a set of transcription factors 
and coupled small molecules has conventionally been used for conditional 
expression strategies [6-8]. Similarly, the transcriptional control of shRNA 
expression using small molecules has been employed for the construction of tun- 
able genetic switches based on RNAi [9]. This system has also combined the Lac 
inhibitor with shRNA RNAi to synergistically suppress target gene transcription 
and translation. 


8.2 Development of RNAi Switches that Respond to Trigger Molecules 


Hereafter, we will focus on an RNA design strategy and the posttranscriptional 
gene expression control of RNAi switches via several trigger molecules including 
small molecules, oligonucleotides, and proteins (Figure 8.2). These triggers bind 
to specific RNA sequences, and the interactions between them can be employed 
to generate RNAi switches (Table 8.1). The advantages and potential applications 
of RNAi switches primarily depend on the type of trigger. Small molecule trig- 
gers that penetrate through the cell membrane tune the function of RNAi 
switches by adjusting the extracellular concentration of the input molecules, 
which is a mechanism similar to that of small molecule-inducible transcription 
factors. Oligonucleotide triggers, such as DNA, RNA, and modified oligonucleo- 
tides (MONs), are able to form Watson—Crick base pairs with designed RNA 
devices and thus adjust specificities and affinities between the trigger molecules 
and the devices. Protein triggers can also be used to control the functions of 
RNAi switches. Thus, specific proteins expressed in cells can distinguish target 
cell types based on the intracellular environment. 


8.2.1 Small Molecule-Triggered RNAi Switches 


Small molecule-triggered RNAi switches have been designed to modulate Dicer 
or Drosha processing of shRNA or pri-miRNA. Initially, three different switch 
design strategies implementing a theophylline aptamer were employed to achieve 
theophylline-responsive properties [20]. In the first design approach to obtain 
theophylline-responsive shRNA switches, the loop region of EGFP- (or DsRed-) 
targeting shRNA was replaced with a theophylline aptamer containing a loop 
sequence; this replacement was designed to create a theophylline and RNA com- 
plex around the Dicer recognition site [10]. When expressed in HEK293 cells, the 
switches inhibited Dicer processing and knockdown of reporter fluorescent 
genes in the presence of theophylline in culture medium. A similar approach was 
applied to the development of pri-miRNA-based RNAi switches (pri-miRNA 
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Figure 8.2 RNAi switch design strategies with a variety of trigger molecules. The RNA motifs 
that bind to specific trigger molecules are introduced into the appropriate regions in 
pri-miRNA, pre-miRNA, shRNA, or siRNA. The motifs embedded in the RNA are then optimized 
to generate functional RNAi switches. 
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8.2 Development of RNAi Switches that Respond to Trigger Molecules 


switches), and the theophylline aptamer was introduced into the Drosha 
recognition site of pri-miRNA [14]. The second strategy to design switches was 
based on the changing stability of the two RNA reversible conformational states 
(active or inactive) via trigger (theophylline or hypoxanthine) binding [11]. In the 
absence of triggers, shRNA switched from the canonical dsRNA structure (active 
state) that is required for EGFP knockdown; the binding of the trigger changed 
the secondary structure of the switches, and part of one dsRNA strand stably 
bound to the adjacent loop sequence (inactive state) and collapsed the canoni- 
cal dsRNA structure. The third strategy employed an irreversible conforma- 
tional change of the pri-miRNA structure and ligand-controlled hammerhead 
ribozymes [12]. In the absence of theophylline, the allosteric hammerhead 
ribozyme domain and the following inhibitory strand that hybridizes the pri- 
miRNA collapsed the canonical structure that is required for Drosha processing. 
In the presence of theophylline, ribozyme self-cleavage induced the exposure of 
the 5’-single-stranded region that was originally masked by the inhibitory strand, 
resulting in Drosha processing. Another design strategy considered endogenous 
pre-miRNA as potential RNAi switches. In this system, Dicer or Drosha process- 
ing was inhibited by bioactive small molecules that target pre-miRNAs [19]. The 
strategy employed the tight binding pair of a benzimidazole and RNA internal 
loop motif from a database of RNA motif-small molecule interactions and 
searched the Dicer or Drosha recognition sites of disease-related pre-miRNA for 
the RNA internal loop motif [19]. The motif was found and well fit with the 
Drosha recognition site of human pre-miR-96. In the result the benzimidazole 
specifically inhibited the endogenous pre-miR-96 maturation, recovered the 
downstream proapoptotic gene FOXO1 expression and induced apoptosis in 
MCE? cells. 


8.2.2 Oligonucleotide-Triggered RNAi Switches 


Oligonucleotide-triggered RNAi switches have been designed to modulate Dicer 
processing of siRNA or Drosha processing of pri-miRNA. The strategy for 
designing the switches is based on toehold-mediated oligonucleotide displace- 
ment. DNA-mediated siRNA switches are composed of two DNA-RNA hybrids 
[18]. The RNA strands of the hybrids are split sense and antisense strands 
of siRNA. The DNA strands contain additional nucleotides (toeholds) in the 
hybridization region, and the two DNA strands can potentially form a double 
strand. The initial DNA—RNA hybrid is not processed by Dicer because it cannot 
recognize a DNA-RNA hybrid [21]. After the DNA—RNA hybrids are trans- 
fected into mammalian MDA-MB231 cells, the DNAs bind to each other at the 
toehold region and replace RNA with antisense DNA to produce double-stranded 
DNA and siRNA. The siRNA is then processed by Dicer to knock down target 
genes. 

A small RNA-triggered siRNA switch was designed to block the double- 
stranded formation of sense strand and antisense strand RNA in the absence of a 
trigger. For this, an inhibitory sequence is connected to the 3’ end of the sense 
strand and is partially hybridized with the antisense strand-binding region of the 
sense strand [13]. Meanwhile, the trigger RNA when present hybridizes with an 
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inhibitory sequence, resulting in perfect hybridization of the sense and antisense 
strands. The resulting dsRNA is processed as active siRNA by Dicer (OFF-to-ON 
RNAi switch). MON-triggered pri-miRNA switches were also designed by intro- 
ducing an inhibitory RNA stem loop into the 3’ end of the pri-miRNA to conceal 
the single-stranded region of the Drosha recognition site [15]. MON targets 
endogenous genes, and the pri-miRNA contains a target sequence for a fluores- 
cent protein (EGFP or DsRed) in EGFP- or DsRed-expressing HeLa cells. When 
present, MON perfectly hybridizes with half of the inhibitory RNA stem-loop 
sequence, resulting in an RNA conformational change and exposure of the 
single-stranded region that is recognized by Drosha. The resulting dsRNA is also 
processed as an siRNA by Dicer. 


8.2.3 Protein-Triggered RNAi Switches 


Protein-triggered RNA switches have been designed by replacing the loop region 
of shRNA with protein-binding sequences in an attempt to mask the Dicer rec- 
ognition site in the presence of trigger protein molecules [16, 17]. The specific 
and tight RNA-protein interaction (RNP) motif is important when designing an 
efficient RNAi switch. For these switches, an RNP motif consisting of an archaeal 
ribosomal protein, L7Ae, and its binding partner, box C/D kink-turn RNA (Kt), 
is employed to develop L7Ae-triggered shRNA switches (Kt-shRNA) [16]. L7Ae 
binds to the loop region of Kt-shRNA, which inhibits Dicer cleavage and targets 
gene knockdown. Following Kt-shRNA development, designed shRNA switches 
triggered by the human splicing-related protein U1A and the human transcrip- 
tional regulator NF«B (p50 domain) were developed [17]. The RNP motifs of the 
UIA protein and the loop sequence of U1 snRNA or the loop-stem-loop sequence 
in the 3’ untranslated region of ULA mRNA were utilized to develop two types of 
U1A-triggered shRNA switches. An RNP motif composed of the NF«B protein 
and an artificially selected NF«B-binding aptamer was also employed to develop 
an NF«B-triggered shRNA switch. The molecular structures of these RNP motifs 
were solved via crystal or nuclear magnetic resonance (NMR) structural analyses 
and utilized to create three-dimensional (3D) molecular designs of the switches. 
The switches were designed by incorporating a protein-binding sequence into 
the loop region of shRNA, which contains 22—-28bp of dsRNA targeting the 
EGFP gene in the stem region. The configurations of these switches were three- 
dimensionally optimized such that the interaction between the trigger protein 
and shRNA efficiently blocked Dicer processing. 


8.3 Rational Design of Functional RNAi Switches 


Rational and predictable RNA design strategies are critical for developing versa- 
tile RNAi switch systems. Hereafter, we will focus on design strategies for RNAi 
switches. The most common design strategy for RNA switches utilizes predicted 
RNA secondary structures and their free energies based on Watson—Crick base 
pairing in the presence/absence of trigger molecules. This strategy also attempts 
to optimize the free energy difference between the two states by changing base 


8.4 Application of the RNAi Switches 


pair lengths and introducing mutations [11, 15, 22-25]. Several ligand- (e.g., 
small molecule or oligonucleotide) responsive RNAi switches have been designed 
based on this strategy. When the secondary structure, free energy difference, and 
base pair length have been optimized, a similar design strategy could be applied 
to generate various RNAi switches that respond to different trigger molecules. 

A useful and efficient 3D design approach has been utilized to develop protein- 
triggered RNAi switches by employing available 3D RNP structures (analyzed via 
both NMR and X-ray crystallography) [17, 26]. For the first approach, the struc- 
tural components of shRNA switches were three-dimensionally reconstructed in 
silico by creating 22-28 bp of A-form dsRNA with 3D molecular design software 
and loading the RNP motif, composed of the trigger protein and its binding RNA 
motif, from the Protein Data Bank (Figure 8.3a, left). Then, 3D structural models 
of the trigger protein-bound shRNA switches were constructed by superimpos- 
ing the few terminal nucleotides of the RNA loop on the dsRNA using minimiza- 
tion methods consisting of the least squares approximation polynomial and 
connecting the loop with the dsRNA. The models predicted the structural states 
of the shRNA switches in the presence of the trigger protein (Figure 8.3a, right). 
As described in Figure 8.3, the bound trigger protein on the shRNA switch 
rotates approximately 30° in a counterclockwise direction around the axis of the 
dsRNA with a 1-bp insertion and is located ~2.6 A farther from the site of Dicer 
cleavage. 

Because the Dicer enzyme can access the 22nd nucleotides from both the 
5’ and 3’ ends [27], the bound trigger protein on the switches was designed to 
block Dicer access. Specifically, the base pair lengths of the switches were 
adjusted by taking advantage of the orientation change of each base pair such 
that the bound trigger proteins could block Dicer access (Figure 8.3b). To predict 
in silico the collision between Dicer and the bound trigger protein, the con- 
structed switch models were superimposed on the catalytic sites and the periph- 
eral region of Giardia Dicer with reference to the Dicer cleavage sites and 
catalytic sites. Based on the results of the 3D molecular design and switch assess- 
ment, steric hindrance between Dicer and the shRNA-bound protein was pre- 
dicted in silico, which positively correlated with the inhibition of Dicer cleavage 
in vitro and target gene expression in living cells. Furthermore, the 3D molecular 
design method could be applied for all switches that sense several different RNA- 
binding proteins (e.g., L7Ae, U1A, and NF«B) and could be used to predict the 
functions of these proteins. In principle, the strategy could predict functional 
switch structures in response to RNA-binding proteins to adjust the ON/OFF 
ratio of the designed switches. 


8.4 Application of the RNAi Switches 


RNAi switches have been proposed for applications including drug delivery, 
RNAi reporters, conditional knockdown, and cell fate controls (Figure 8.4). For 
example, DNA-mediated siRNA switches consist of DNA—RNA hybrids that 
may be suitable for the systemic delivery of siRNA. In vivo (mouse) studies 
have demonstrated that these switches promote degradation resistance in the 
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than switch Y. (b) Dicer can access and process switch Y via trigger protein binding; the 
switch then induces the knockdown of its target gene via RNAi (left). Dicer is inaccessible to 
switch X in the presence of the trigger protein because the RNP interaction faces Dicer and 
inhibits its access (right). The prevention of Dicer function causes the derepression of gene 
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Figure 8.4 Applications of RNAi switches. RNAi ON/OFF switches can be applied to RNAi 
reporter and cell fate conversion systems that respond to specific trigger molecules. 


bloodstream, efficient uptake by tumors, and reassociation-triggered activa- 
tion of split siRNA functioning [18]. MON-triggered pre-miRNA switches can 
be applied to reporters of activated siRNA molecules in individual cells. Equal 
amounts of MON (including siRNA and pre-miRNA targeting for fluorescent 
proteins) are produced in the nucleus when Drosha cleaves MON-bound pre- 
miRNA switches [15]. The levels of RNAi and siRNA molecules can thus be 
more precisely monitored and visualized than with co-transfection of target 
and reference (fluorescent protein target) siRNA. Protein-triggered shRNA 
switches can be applied to control cell fate. Protein-triggered shRNA switches 
can respond to human U1A and NF«B protein expression within cells [26]. 
L7Ae-triggered shRNA switches were shown to control human cell fate by reg- 
ulating the balance between proapoptotic (Bim) and antiapoptotic (Bcl-xL) 
protein molecules via the knockdown of antiapoptotic proteins (Bcl-xL) and by 
determining the status of mitochondrial-dependent apoptosis pathways. The 
expression of L7Ae determines cell survival. 


8.5 Future Perspectives 


Recent intensive research has resulted in the development of RNAi switches that 
are triggered by multiple chemicals and biomacromolecules. To improve the 
ability of RNAi switches to rewire gene regulatory networks, however, there 
are several challenges to overcome regarding switch efficiency and the variety 
of specific trigger molecules for other RNA switches. For example, protein- 
triggered shRNA switches require high plasmid expression levels of the trigger 
protein. To generate switches that respond to endogenous protein molecules, 
designed RNA devices must efficiently and selectively detect target proteins [28]. 
Extra signal amplification systems such as synthetic positive feedback loops may 
be required to generate sufficient protein signals. Additionally, an orthogonal 
RNA-protein-binding pair that does not interfere with natural RNA or protein 
molecules is desirable to sense target proteins without inducing side effects. 
Thus, it is important to develop an automated and easy selection method to 
generate such specific RNA-protein-binding pairs from RNA motif libraries. 
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9.1. Introduction 


Engineering small molecule-responsive RNA switches in bacteria was motivated 
by the discovery of natural prokaryotic riboswitches in 2002 [1-3]. These 
endogenous RNA cis-regulatory elements are usually found in the 5’ untrans- 
lated region (UTR) of prokaryotic mRNAs and modulate gene expression in 
response to various metabolites [4]. A typical riboswitch contains an aptamer 
domain that is responsible for metabolite binding and an expression platform 
that facilitates a ligand-dependent structural change that influences gene 
expression. For example, a metabolite-mediated structural change may alter the 
accessibility of the ribosome binding site (RBS), which results in a change in 
translation efficiency, or dictate the formation of a transcription terminator 
structure (a stem-loop followed by a short poly(U) tract) that results in prema- 
ture termination of the transcript. The ability of these riboswitches to control 
gene expression in response to small molecules of biological or synthetic origin 
can be very useful in synthetic biology and metabolic engineering. This section 
provides a brief overview of the previous major efforts to engineer small 
molecule-responsive RNA switches in bacteria. 


9.2 Design Strategies 


9.2.1 Aptamers 


An RNA aptamer that specifically binds to a desired small molecule ligand is a 
prerequisite to engineering riboswitches. Most published synthetic riboswitches 
have used known aptamers selected in vitro or metabolite-binding aptamers 
found in natural riboswitches. At least one group has performed in vitro selection 
to develop novel aptamers specifically for riboswitch applications in bacteria [5]. 
While researchers have successfully performed in vitro selection to discover 
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aptamers against numerous small molecules [6], isolating new aptamers for 
novel targets for synthetic riboswitch applications is still likely to remain a chal- 
lenge in itself [7]. Although in vitro selection often yields aptamers with respect- 
able affinity and specificity, a major challenge when using them in the cellular 
context is the difficulty in predicting how the affinity, stability, or folding of the 
aptamers is altered inside the cells. To mitigate such uncertainty, Gallivan and 
coworkers used a pool of affinity-enriched aptamers from an in vitro selection, 
rather than few isolated aptamer clones, in their effort to engineer riboswitches 
that respond to an herbicide in Escherichia coli [5]. 

In another notable effort, Dixon and coworkers conducted an in vivo screen to 
modify a natural aptamer to recognize an alternative synthetic analog [8]. 
Although this strategy is likely to be limited for engineering aptamers for a ligand 
that are structurally similar to an existing ligand, it represents a viable alternative 
route to obtain a set of orthogonal riboswitches. 


9.2.2. Screening and Genetic Selection 


With an aptamer in hand, designing an appropriate expression platform becomes 
the primary challenge in engineering small molecule-responsive riboswitches. 
By far, the most successful strategies have employed some form of medium- to 
high-throughput screening or genetic selection at this stage to discover func- 
tional riboswitches with desired characteristics. Generally, a short stretch of 
sequence near an aptamer embedded in the 5’ UTR is randomized with an antic- 
ipation that a subset of those sequences will function as an expression platform. 
This pool of riboswitch mutants is subjected to suitable screening or selection 
steps to enrich functional riboswitches and to eventually isolate individual 
clones. 

Genetic selection enables rapid enrichment of potential riboswitches from a 
large population (>10°) of mutants by coupling the survival or growth of the 
bacteria with those expressing functional riboswitch mutants. Nomura and 
Yokobayashi isolated riboswitches entirely through genetic selection for the first 
time [9]. This was achieved by the use of tetracycline antiporter (tetA) as a selec- 
tion marker to enable both ON and OFF selection. In this system, ON cells are 
selected using tetracycline and OFF cells are selected using NiCl, added to the 
culture media [10]. The group later improved the method by adding a fluores- 
cent reporter gene (GFPuv) as a translational fusion to TetA to enable rapid 
screening of the genetically selected mutants [11]. 

Alternatively, Topp and Gallivan devised a selection strategy based on cell 
motility by coupling the riboswitch output with the expression of cheZ, which 
confers cell motility when expressed in a AcheZ host [12]. In this method, cells 
are physically isolated on a semisoft agar plate based on their motility. 

Although genetic selection enables examination of relatively large number of 
mutants primarily limited by the transformation efficiency, it is often difficult to 
fine-tune the selection pressures to engineer devices with precise characteris- 
tics. A complementary strategy is to employ a reporter gene such as green fluo- 
rescent protein (GFP) and quantitatively measure the riboswitch performance 


9.3 Mechanisms 


of individual mutants. The Gallivan and the Hartig groups, among others, have 
taken this approach by evaluating hundreds to thousands of riboswitch clones 
by reporter gene assay [13-15]. Although more laborious and costly compared 
with genetic selection, screening of individual clones provides quantitative 
characteristics of every mutant evaluated (ON and OFF expression levels). 
More recently, fluorescence-activated cell sorting (FACS) has been used to fur- 
ther increase throughput [16, 17]. 


9.2.3. Rational Design 


Despite the extensive research on the natural riboswitch mechanisms and struc- 
tures, rational or computational design of synthetic bacterial riboswitches has 
been few and far between. An earlier example by Suess et al. highlighted the 
potential of rationally engineering ligand-induced structural shift to regulate 
bacterial gene expression [18]. More recently, computationally driven designs of 
bacterial RNA switches based on ribozymes [19] and transcriptional regulation 
[20] have emerged. However, some level of experimental feedback is still expected 
to be essential due to the complexity of parameters that influence the perfor- 
mance of these RNA devices in living cells. 


9.3. Mechanisms 


9.3.1 Translational Regulation 


Translational regulation by bacterial riboswitches involves a change in the 
local structure of the ribosome binding site (RBS) upon ligand binding. RBS 
within a stable structure generally hinders ribosome access and results in 
repressed translation. Engineering such ligand-induced structural changes, 
however, is not trivial, and screening or selection is often used in the process. 
In some cases, naive randomization of the nucleotides peripheral to the RBS 
followed by screening or selection was sufficient for isolation of suitable 
expression platforms [9, 13]. In other cases, riboswitch libraries were carefully 
designed to predispose the riboswitch mutants to undergo a specific structural 
shift. An example of the latter is shown in Figure 9.1a where the RBS was stra- 
tegically placed to form a putative stem at the base of the aptamer upon ligand 
binding so that the riboswitch negatively responds to the aptamer ligand [21]. 

In another strategy, the RBS was placed so that it becomes accessible only 
when the hammerhead ribozyme self-cleaves, and the aptamer was inserted 
in one of the stem-loops of the ribozyme to control its activity (Figure 9.1b) 
[14, 15, 22]. In this strategy, because the translation efficiency is directly coupled 
to the ribozyme activity, small molecule response is actually engineered at the 
level of the aptamer—ribozyme hybrid, or aptazyme. The Hartig group has 
exploited the aptazyme strategy further by adapting them to control other 
translational components such as tRNA [23] and rRNA [24] to construct small 
molecule-responsive RNA switches in E. coli. 
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ON OFF 


OFF ON 


a 


Figure 9.1 (a,b) Examples of synthetic riboswitch libraries. 


9.3.2 Transcriptional Regulation 


Relative to translationally regulated riboswitches, transcriptional regulation has 
not been extensively exploited in engineered riboswitches. Although this could 
be partly due to the complex mechanisms observed in the transcriptionally regu- 
lated natural riboswitches that involve folding kinetics of multiple RNA elements 
[25], several recent publications indicate that it is possible to engineer such 
riboswitches. Wachsmuth et al. used computational tools to rationally design a 
transcriptionally regulated riboswitch using a theophylline aptamer [20]. After 
some iterative improvements, they obtained a riboswitch with a respectable ON/ 
OFF ratio of 6.5 in E. coli. 

Alternatively, Qi and coworkers engineered trans-acting small noncoding 
RNAs (ncRNA) that function by transcriptionally regulating the target gene 
[26]. Their system is based on the antisense RNA-mediated transcription 
attenuation observed in the staphylococcal plasmid pT181 [27, 28]. By stra- 
tegically fusing an aptamer and the ncRNA in tandem and screening of 
mutants in E. coli, the group successfully isolated small molecule-regulated 
trans-acting RNA switches. 

More recently, Ceres and coworkers discovered that certain expression plat- 
forms of transcriptionally regulated riboswitches can accommodate different 
natural and synthetic aptamers without losing the gene regulatory function [29]. 
They were also able to qualitatively tune the device characteristics by adjusting 
the strength of the key stem sequences in a predictable fashion. 


Keywords with Definitions 


9.4 Complex Riboswitches 


Although the majority of natural riboswitches provides a simple yet efficient 
means of metabolite-controlled gene expression, few noncanonical riboswitches 
that contain two aptamers have been reported [30-32]. These riboswitches have 
been shown or predicted to exhibit functions more complex compared with the 
single aptamer riboswitches, such as cooperative response to a ligand [30] or 
Boolean logic response to two distinct metabolites [31]. 

Several synthetic mimics of these complex bacterial riboswitches have been 
constructed. Sharma et al. constructed riboswitches that function as AND and 
NAND logic gates in response to theophylline and thiamine pyrophosphate 
(TPP) by tetA genetic selection [33]. More recently, Muranaka and Yokobayashi 
combined two independently optimized TPP riboswitches into the same 5’ UTR 
to construct a “band-pass” riboswitch that activates gene expression within a 
limited range of the ligand concentration [34]. 

In an alternative approach, Klauser et al. recently combined multiple ribozyme- 
based switches to create logic gates [35]. Qi et al. co-expressed two allosteric 
trans-acting ncRNA regulators to demonstrate a NOR logic gate with a small 
molecule and a protein as inputs [26]. 


9.5 Conclusions 


As briefly summarized previously, synthetic riboswitches are attractive tools for 
interfacing bacterial synthetic circuits with small molecules of synthetic or natu- 
ral origins. In particular, the demonstrated versatility of RNA aptamers to recog- 
nize a wide variety of molecules is of practical importance although adapting in 
vitro selected aptamers to intracellular applications remains a technical chal- 
lenge. It is also noteworthy that complex functions that include molecular recog- 
nition, gene regulation, and, in some cases, multiple signal integration can all be 
encoded within one or few short segments of RNA, whereas equivalent switches 
and circuits based on protein transcription factors would require much larger 
genetic information. 


Keywords with Definitions 


Untranslated region (UTR) The sequences within an mRNA that do not code 
for a protein 

Aptamer A nucleotide sequence (e.g., RNA) that is capable of binding specific 
target molecules such as small molecules and proteins 

Expression platform A sequence within a riboswitch that is responsible for 
ligand-mediated structural change resulting in gene regulation 

Riboswitch A stretch of RNA sequence mostly found in the 5’ UTR of bacterial 
mRNAs that binds a metabolite through an aptamer and regulates expression 
of the cis-gene 
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Ribozyme An RNA sequence capable of catalyzing a chemical reaction such as 


self-cleavage 
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Through control of messenger RNA stability, bacteria are able to process infor- 
mation, respond to changing conditions, and maintain homeostasis. Many of the 
naturally occurring mechanisms for transcript stability control (TSC) have been 
elucidated, and a number of studies have leveraged this understanding to dem- 
onstrate that transcript stability can be engineered to control static and dynamic 
gene expression. Collectively, that body of work represents a foundation for 
developing new forward-engineering approaches that harness mechanistic 
understanding to build predictive computational models to guide the develop- 
ment of large-scale genetic devices based on TSC and other means. Further 
increasing our understanding of RNA degradation pathways and mechanisms 
will also improve the ability to anticipate how undesired variations in transcript 
stability may confound device output goals and frustrate engineering efforts. 
Here, we discuss the current state of the art and identify routes for using TSC to 
design increasingly large and complex synthetic biological systems. 


10.1. An Introduction to Transcript Control 


10.1.1 Why Consider Transcript Control? 


In naturally occurring biological systems, RNA-based genetic control mecha- 
nisms play crucial roles in regulating cellular functions. Genome-wide studies of 
bacterial transcript half-lives [1, 2] have underscored the importance that con- 
trol over transcript stability plays in enabling bacteria to process information, 
respond to changing cellular and environmental conditions, and, ultimately, 
maintain homeostasis. Bacterial transcripts are known to persist for times that 
vary in scale over orders of magnitude, from only a few seconds to an hour, or 
more. Nature uses several mechanisms to control transcript persistence [3-6], 
and experimental evidence has shown that these mechanisms can be engineered 
[7-10], providing a route to programming static and dynamic gene expression 
[11]. Developing technologies for designing variations in transcript stability 
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could, therefore, increase the speed with which synthetic biological systems can 
be created for applications in basic science; for the production of renewable 
chemicals, fuels, and materials for global health; and for the development of new 
therapeutic agents. Even when transcript stability is not an explicit aspect of a 
given genetic control device (e.g., ribosome binding site (RBS) control of transla- 
tion initiation), unknown or poorly characterized effects on transcript degrada- 
tion may affect genetic device outputs. It is therefore important to regard 
transcript stability through two lenses: as a “tuning knob” for predictably con- 
trolling gene expression dynamics and as a confounding factor if unaccounted 
for in genetic device design. 

In this chapter, we describe current understanding of transcript stability and 
processing for designing and engineering genetic expression devices with 
predictable functions. In Section 10.1, we consider the machinery that controls 
transcript stability within bacteria, with specific focus on Escherichia coli. 
Section 10.2 examines efforts to utilize this machinery for controlling gene 
expression dynamics. In Section 10.3, we consider ways of managing transcript 
stability to reduce unintentional and confounding effects. Section 10.4 details 
possible strategies for controlling transcript stability and points to future 
research directions in computation and wet-lab experimentation that may lead 
to design technologies for rapidly engineering genetic devices. The final section, 
Section 10.5, will provide a summary of the chapter. 


10.1.2. The RNA Degradation Process in E. coli 


RNA is degraded through multistep pathways that can begin as soon as a 
transcript has been synthesized by an RNA polymerase. Degradation of mRNA 
typically begins with a rate-limiting, RNase E-mediated phosphodiester bond 
cleavage event. RNase E cleavage is followed by subsequent rounds of 3’ > 5’ 
degradation (E. coli has no known 5’ 3’ exoribonuclease) [4], carried out in 
concert with the degradosome, a collection of four enzymes— RNase E, RhIB, 
PNPase (polynucleotide phosphorylase), and enolase [3, 12-14] - that localizes 
to the membrane [15, 16]. 

RNase E [17], an endoribonuclease and a rate-limiting cleavage enzyme, is 
thought to bind and process transcripts via two mechanisms (Figure 10.1), the 
first of which is 5’ entry at a monophosphorylated end. It was discovered to 
prefer substrates with unpaired 5’ ends in vivo [18], and early in vitro analysis of 
RNase E activity showed a manyfold reduction in cleavage rate when three dif- 
ferent RNAs were 5’ triphosphorylated (5’-PPP) instead of 5’ monophosphoryl- 
ated (5’-P) [19]. The structure of the RNase E domain was later shown to have a 
binding pocket that cannot accommodate substrates larger than a 5’-P [20], 
explaining the selectivity toward 5’-P- versus 5'-PPP-terminated transcripts and 
the increased half-life of 5’ hydroxyl (5’-OH)-terminated transcripts [21]. As 
transcripts are synthesized natively with 5’-PPP, it was hypothesized, and later 
shown, that 5’-P RNAs are created in cells through the removal of the gamma- 
and beta-phosphate from 5’-PPP RNAs [21]. This conversion, which creates the 
direct substrates for RNase E cleavage, was found to be catalyzed by RppH, an 
RNA pyrophosphohydrolase [22]. 
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Figure 10.1 Primary routes for RNase E-mediated mRNA degradation.“5’ entry” is initiated 
when an mRNA undergoes 5’-PP removal, catalyzed by the pyrophosphohydrolase enzyme 
RppH, creating a 5’-P that can be recognized and bound by RNase E (shown by “+” symbol). 
“Direct entry” (at right) is 5’ independent entry by RNase E that occurs without recognition 
and binding to a 5’-P moiety. Following RNase E binding, an initial cleavage event generates 
3’-OH- and 5’-P-terminated RNAs that are efficient substrates for 3’ > 5’ degradation to 
monomers and further rounds of RNase E binding and cleavage. 


The second method of RNase E substrate recognition, often termed “direct 
entry,’ bypasses the 5’ end [23]. Experiment has demonstrated that mRNAs con- 
taining a putative 5’ hairpin to inhibit 5’ RNase E binding is still degraded in an 
RNase E-dependent manner [24], and the insertion of putative RNase E sites into 
the coding region decreased the stability of RNA with a 5’ hairpin [25]. 
Additionally, several RNAs have been identified that can be rapidly degraded by 
the RNase E catalytic domain even if they are not terminated with a 5’-P (ie., if 
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they are 5’-PPP- or 5’-OH-terminated RNAs), thought to occur when RNase E 
binds directly to unpaired regions in the coding sequence [26]. 

Once RNase E is associated with the RNA, either through 5’-P binding or 
direct entry, it can scan the transcript and catalyze cleavage at the initial target 
site [27, 28], setting in motion the recruitment of the other members of the 
degradosome and further degradation by 3’— 5’ exoribonucleases and subse- 
quent rounds of RNase E activity [29]. PNPase, a 3’ > 5’ exoribonuclease, binds 
to polyadenylated 3’ mRNA ends [30]. RhIB is an adenosine triphosphate (ATP)- 
dependent helicase implicated in preparing RNA for RNase E and PNPase cleav- 
age by removing secondary structure. When inhibited, RhIB no longer enhanced 
PNPase-mediated degradation in an ATP-dependent way [31], and when RhIB 
was deleted, lacZ mRNA was stabilized in a ribosome-free context by impaired 
RNase E cleavage at the 5’ end [32]. Enolase is the least well-understood member 
of the degradosome and is thought to have a role in metabolism-related tran- 
script degradation [33]. 

Additional means of initiating degradation occur through RNase III cleavage 
and RNase G cleavage. RNase III is thought to primarily bind and cleave second- 
ary structures [34], often in the context of rRNA maturation and decay [35]. 
RNase G, an RNase E homolog, is usually involved in 9S rRNA maturation but in 
a small number of cases initiates mRNA decay as well [36]. 

While this is not an exhaustive account of the mechanisms related to RNA 
decay in E. coli, the aforementioned mechanisms are responsible for the majority 
of messenger RNA decay [3] and are the most salient for programming variations 
in gene expression levels. 


10.1.3 The Effects of Translation on Transcript Stability 


The development of a complete mechanistic understanding of RNA degradation 
has been complicated by the effects that ribosomes and translation have on tran- 
script stability (Figure 10.2a) [37]. For instance, ribosome binding has been found 
to attenuate RNase E cleavage in several studies. Incubation of the ompA mRNA 
with increasing molar excesses of 30S ribosomal subunits substantially reduced 
RNase E cleavage in the 5’ untranslated region (UTR) [38]. /acZ transcript half- 
life was correlated with B-galactosidase enzyme activity (a proxy for translation 
efficiency) when changes were made to the RBS [39], suggesting that ribosome 
occupancy positively influences transcript half-life. Taken together, these results 
support a simple steric hindrance model where the presence of ribosomes on an 
mRNA inhibits RNase E binding and cleavage [37]. 

Moreover, because translation is co-transcriptional in bacteria, the transcrip- 
tion rate can influence susceptibility to RNase E cleavage by determining the 
length of exposed transcript. If transcription outpaces the rate of ribosome bind- 
ing and translation initiation, much of the transcript, including potential RNase E 
binding sites, will be exposed. A study with the /acZ gene and mutant T7 bacte- 
riophage polymerases in E. coli showed an inverse correlation between 
B-galactosidase activity and the rate of T7 RNA transcription, a trend that was 
RNase E dependent [40]. Experiments using premature stop codons to render 
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Figure 10.2 Naturally occurring transcript stability control mechanisms. (a) MRNAs that are 
highly occupied by translating ribosomes RNA will have occluded sites for RNase E 5’ entry 
and direct entry, leading to relatively long transcript half-life and high levels of gene 
expression. Exposed transcripts (i.e., with lower ribosome density), such as mRNAs transcribed 
with bacteriophage polymerases with fast elongation rates, are more susceptible to RNase E 
attack due to a lack of occluding ribosomes. (b) sSRNA and asRNA operate through Hfq- 
mediated binding to the RBS (green box) and/or start codon region of a target mRNA, which 
prevents ribosome docking and likely recruits RNase E to the transcript. (c) The addition of 
poly(A) tails to a transcript, usually by poly(A) polymerase (PAP 1), creates a foothold for 
binding by polynucleotide phosphorylase (PNPase), a 3’ 5’ exoribonuclease. 


transcripts ribosome-less at known RNase E cleavage sites showed decreased 
transcript stability [25, 41]. 

The complex interplay of other transcript-related mechanisms that affect tran- 
script degradation is even less well understood. Ribosomal pausing can lead to 
cleavage by unknown ribonuclease activity. The subcellular localization of indi- 
vidual transcripts also affects ribosomal occupancy, which in turn affects RNA 
degradation [37]. 


10.1.4 Structural and Noncoding RNA-Mediated Transcript Control 


In most cases, RNase E must bind the transcript— either at the 5’ end or at a 
single-stranded interior region — to initiate cleavage and begin degradation. This 
implies that anything affecting RNase E binding will also affect transcript stabil- 
ity. Ribosomes, as explained before, are one such factor. RNA secondary struc- 
tures, or other stable base pairings that prevent access to the 5’ end or internal 
RNase E sites, are therefore expected to reduce RNase E binding and increase 
transcript stability (Figures 10.2b and 10.3a). 

Antisense oligos that base-pair near the 5’ end of a transcript were shown to 
lower the rate of RNase E cleavage [19], and studies of naturally long-lived mRNA 
in E. coli have pointed to 5’ hairpins as a means of precluding RNase E docking 
and conferring transcript stability [31] [32]. A naturally occurring riboswitch, or 
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Figure 10.3 Examples of engineered transcript stability control. (a) Synthetic secondary 
structure hairpins within the 5’ UTR can increase half-life by preventing RNase E 5’ entry; 
direct entry by RNase E is still possible. (b) 5’ ribozyme-mediated transcript cleavage creates a 
5'-OH not recognized by RppH and therefore cannot become a 5’-P for RNase E to bind; 
degradation of these processed transcripts occurs through RNase E direct entry. 

(c) Riboregulators and riboswitches typically work by cis-RNA sequestration of the RBS, which 
can be relieved by either trans-RNA binding to the cis-RN or ligand binding to the cis-RNA. 
These binding events free the RBS from the cis-RNA and therefore allow translation. 


functional RNA that changes conformation upon ligand binding to dynamically 
sequester or present an RBS [42], has been discovered that also uncovers RNase E 
cleavage sites when bound to a target metabolite, lysine. As a result, lysine bind- 
ing reduces the rate of translation initiation from the RBS and decreases 
transcript half-life, further reducing protein expression [28]. More generally, 
riboswitches are thought to decrease transcript stability in the RBS-sequestering 
state by precluding ribosome binding [28] (also see Section 10.1.3). 

Like the antisense oligos, small RNA (sRNA) can create base pairing in the 5’ 
UTR and thus limit binding by ribosomes or RNase E. sRNA, an abundant form 
of regulatory noncoding RNA (ncRNA) in bacteria [43], is typically tens to hun- 
dreds of bases long. They usually function to enhance or repress ribosome bind- 
ing by base pairing, via a short (10—20-bp) seed region, with the target mRNA at 
the Shine-Dalgarno (SD) and/or start codon regions of an mRNA [6, 44, 45]. In 
many cases, sRNA target binding is mediated by the RNA chaperone activity of 
the Hfq protein [46, 47]. Hfq deletion studies point to the importance of Hfq in 
sRNA action [47-49] and stability, with increased sRNA degradation by PNPase 
in stationary-phase Hfq-strains [50]. Hfq has been found to coprecipitate with 
nearly half of the known sRNAs of Salmonella [51] and with a quarter of the then 
known sRNAs in E. coli [52]. Interactions between some sRNA and mRNA can 
occur in the absence of Hfq, however [48, 53]. 
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sRNAs can enhance translation when binding to the mRNA eliminates cis- 
acting secondary structure involving the SD sequence. In turn, ribosomes can 
bind to the SD and initiate translation, protecting the mRNA from degradation 
[54-57]. sRNAs can repress translation when binding to the mRNA occludes 
ribosome binding [58, 59], rendering the transcript susceptible to endonuclease 
cleavage due to a lack of protective ribosomes. In the latter case, sRNA binding 
seems to recruit RNase E through interaction with Hfq and the 5’-P of the 
sRNA[48], hastening degradation of both the sRNA and targeted mRNA. It is 
interesting to note that sRNA-mediated RBS occlusion is sufficient for down- 
regulation; thus in some cases, degradation serves only to make the downregula- 
tion irreversible [53, 60]. 


10.1.5 Polyadenylation and Transcript Stability 


Unlike in eukaryotes, polyadenylation of bacterial mRNA is not associated with 
transcript maturation and increased stability [5], but instead has been associated 
with mRNA destabilization (Figure 10.2c). Interestingly, although generally 
implicated in the degradation of nonfunctional or mutated RNAs as part of qual- 
ity control mechanisms [61], there are several examples where polyadenylation 
is employed to modulate gene expression [5, 62, 63]. The half-life of rpoS mRNA 
in E. coli decreased when polyadenylated in the absence of RNase E, where 
polyadenylation depended on pcenB [64], the gene coding for poly(A) polymer- 
ase [65]. Poly(A) tails are used as footholds for exoribonucleases, such as PNPase, 
that bind the poly(A) tails and perform 3’ > 5’ degradation [30, 62]. The half-life 
of three mRNA different transcripts increased when pcnB was knocked out, 
coinciding with poly(A) tails shortened up to 90% [66]. 


10.2 Synthetic Control of Transcript Stability 


10.2.1 Transcript Stability Control as a “Tuning Knob” 


As outlined in Section 10.1, transcript stability is determined through the collec- 
tive impact of a multitude of sequence and structural features. The 5’ terminus 
identity (ie., 5’-PPP vs 5’-P vs 5'-OH) and the presence of stable secondary 
structures within the 5’ UTR affect 5’ end accessibility by RppH and RNase E. 
Active translation creates steric hindrance and ribosome occlusion that reduces 
internal accessibility by RNase E. Finally, 3’ end accessibility by PNPase varies 
according to 3’ UTR secondary structure, polyadenylation state, and the pres- 
ence or absence of sRNAs that mediate degradation. Because RNAs can be tran- 
scribed and degraded within the space of only a few minutes, variations in 
transcript stability can have dramatic effects on RNA levels. This implies that 
gene expression can be controlled quickly and dynamically by modulating the 
sequence and structural features that directly affect transcript stability. In natu- 
rally occurring systems, swings in transcript abundance allow cells to respond to 
changing conditions and, for instance, reestablish perturbed homeostasis or 
respond to the buildup of intra- or extracellular toxins [2]. TSC thus presents a 
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powerful platform for meeting system design goals for applications, such as met- 
abolic pathway engineering or biosensing, that require the ability to generate 
specific levels of static gene expression or dynamic genetic outputs that change 
as the function of a targeted molecule. Many of the naturally occurring mecha- 
nisms can be tuned to program static levels of gene expression [67], and dynamic 
control [11] is possible if these static mechanisms are regulated by the binding 
activities of functional RNA structures evolved with in vitro selection to bind 
specific metabolites (e.g., RNA aptamers, or RNA aptamer-regulated ribozymes, 
aptazymes) [7]. 

Several TSC mechanisms have been used over the past 15years in synthetic 
genetic systems. A small number of TSC mechanisms, namely, 5’ and 3’ UTR 
hairpins [67—72], 5’ UTR cleavage [7], and antisense RNA (asRNA)/sRNA bind- 
ing [10, 53, 73-76], have been used to explicitly control transcript stability. (The 
systems developed using these mechanisms are discussed in detail in the follow- 
ing subsections.) Others were not explicit attempts to alter transcript stability 
[9, 10, 77, 78]. Rather, by changing ribosome binding and UTR structure, there 
were likely changes in RNA degradation, even though altered stability was not 
the chief actuator of control. Nevertheless, this substantial body of work has 
significantly advanced knowledge of RNA engineering that will undoubtedly be 
important in creating novel genetic control systems based on tuning mRNA 
stability. Moreover, this work has helped identify RNA components that are most 
easily engineered and understood and has reinforced the many strengths of 
RNA-based technologies, namely, low host metabolic burden [75], inherent 
orthogonality [76], and the evolvability [79, 80] of new components. With this 
work and advancing knowledge of degradation processes, transcript stability 
is poised to become a powerful means of genetic control, either on its own or as 
part of a larger control scheme. 

Moving forward with increasing understanding of RNA device design princi- 
ples and mechanistic understanding of degradation processes, it should be pos- 
sible to formulate model-driven frameworks based on TSC mechanisms. Casting 
biochemical, mechanistic understanding of transcript degradation in terms of 
measurable and tunable design variables will enable us to take advantage of com- 
putational techniques to increase the speed of design, predictability, and scale of 
synthetic biological systems [7]. 


10.2.2 Secondary Structure at the 5’ and 3’ Ends 


The earliest attempts to engineer the stability of transcripts in bacteria involved 
hindering ribonuclease’ entry by adding stable hairpin secondary structures 
to the 5’ end of transcripts (Figure 10.3a) [67—69, 81, 134] or to the 3’ end [70, 71], 
followed by hairpins at both termini [72]. When a hairpin from the T7 gene10 
leader sequence was added to the 5’ end of lacZ in E. coli, B-galactosidase activity 
increased threefold, but only when RNase E was present, suggesting that the 
hairpin increased transcript half-life by reducing RNase E binding and cleavage 
rates [69]. A similar experiment also saw a threefold improvement in half-life 
after a 5’ hairpin addition [67]. Carrier and Keasling built a small library of 5’ 
hairpins that conferred an order-of-magnitude range in half-life, from 2 min up 
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to almost 20min [81]. Similarly, 3’ hairpin introduction has been shown to 
inhibit degradation [70] and increased the penicillinase (penP) transcript half- 
life threefold in both E. coli and Bacillus subtilis [71]. 

These results demonstrated that transcript secondary structures inhibiting 
RNase E and exoribonuclease binding are useful tools for varying mRNA half-life 
in a static manner. Despite these successes, however, it has been difficult to con- 
trol transcript stability in a quantitatively predictable manner through secondary 
structure engineering [72, 82]. In principle, it should be possible to develop more 
explicit design rules if the relationships between secondary structure folding 
kinetics, stability, and RNase E binding occlusion can be further developed. 
Cambray et al. combined experiment with kinetic RNA folding simulation analy- 
sis of a large number of transcriptional terminators to derive heuristics for relat- 
ing sequence and structural features to termination efficiency [83]. Similarly, by 
testing the effect of different hairpin structures on mRNA stabilities in multiple 
transcript contexts, it may be possible to identify rules to understand how 
the sequence and structure of a given hairpin affects half-life. Furthermore, as 
RNase III [34] or helicase activity [32] may mitigate the stabilizing effects of sec- 
ondary structure, more study of RNA sequence and structure interactions with 
these enzymes should lead to better genetic design predictability. 


10.2.3. Noncoding RNA-Mediated 


ncRNA has been used for TSC in two related forms, namely, sRNA and asRNA. 
Both sRNA and asRNA act via an antisense mechanism and base-pair with a 
region — usually the 5’ UTR - ofa target mRNA (Figure 10.2b). In one well-known 
example, asRNA was derived from the RNA-IN/RNA-OUT system from the 
insertion sequence IS10 in E. coli [84]. There, the RNA-IN antisense hairpin 
binds to the RNA-OUT portion of the target mRNA. Although engineered sRNA 
mechanisms have originated from distinct naturally occurring ncRNA systems, 
both asRNA [85] and sRNA target [74, 86] base pairing has been shown to be 
Hfq-mediated in vitro, so it is likely that asRNA and sRNA functions are mecha- 
nistically similar. 

Substantial progress has been made in the past few years toward developing 
sRNA and noncoding trans-RNA as avenues for controlling gene expression. In 
2011, Man et al. developed initial design principles for creating novel sRNAs 
with Hfq-binding sites and regions targeting enhanced green fluorescent protein 
(EGFP) and a native E. coli gene [10]. They tested 16 such sRNAs and reported 
relative expression knockdown levels ranging from 6% to 71%. Furthermore, they 
showed sRNA-dependent target mRNA half-life reduction and used a tempera- 
ture-sensitive RNase E mutant to establish the RNase E dependence of target 
transcript level reduction. Surprisingly, the reduction in gene expression was 
unaffected by the presence or absence of RNase E, suggesting that sRNA binding 
alone was sufficient to reduce translation and that TSC does not play a dominant 
role in this system (see also [60]). 

Sharma et al. randomized the antisense seed portion of the E. coli sRNA 
Spot42 [73] to screen for sRNAs that downregulate a natively targeted gene. 
After a single round of screening, sRNAs were identified that downregulate a 
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natively targeted gene with 45-145-fold repression —an improvement over the 
27-fold repression of the native sRNA. After two rounds of screening for sRNAs 
against a gene with no native sRNA, sRNAs were identified that could repress the 
relative level of gene expression 23—85-fold. 

Most recently, Ishikawa et al. [74], Park et al. [53], and Na et al. [75] have taken 
systematic approaches to uncover design principles for sRNAs that are effective 
at repressing gene expression. Ishikawa et al. studied the SgrS sRNA in E. coli 
using mutational analysis and Northern blotting to elucidate the Hfq-binding 
motif of that sRNA. This motif was incorporated into artificial sRNA against 
three mRNA targets that showed orthogonal Hfq-dependent knockdown via 
Northern blotting. The authors speculate that any mRNA can be effectively 
targeted by designing sRNAs with at least 14 nucleotides (nt) of sequence 
complementarity to the RBS and a cis Hfq-binding motif located within 10 nt. Na 
et al. screened native sRNA scaffolds and potential mRNA target binding sites 
around the SD region and found that a MicC sRNA scaffold with a binding site 
spanning the first 21 nt (not including the SD region) of the target mRNA was 
particularly effective. Using that insight, sRNAs were developed to target native 
genes in a microbial platform engineered to produce L-tyrosine and cadaverine. 
In both cases, the authors were able to employ sRNA-mediated genetic repres- 
sion to divert metabolic flux and increase product formation in the engineered 
system. Collectively, these sRNA design studies suggest that an Hfq-binding, 
scaffold-based sRNA platform may provide a means of downregulating gene 
expression predictably, as binding energy of the antisense region is strongly 
correlated with repression capacity [75]. 

In addition to sRNA, at least two studies have utilized IS10-based asRNA 
against the 5’ UTR. The first study built a model from 529 possible combina- 
tions of 23 sense and antisense pairs (termed RNA-IN and RNA-OUT), which 
was then used to forward-engineer new regulators [76]. A second study built 
upon the RNA-IN and RNA-OUT system by adding a theophylline aptamer- 
based domain upstream of the RNA-IN asRNA, which functions similarly to a 
riboswitch in that gene expression is controlled through structures modulat- 
ing ribosomal access to the RBS. Several designed mutants were screened to 
find an aptamer-RNA-IN pseudoknot interaction that impaired the RNA-IN 
asRNA’s ability to bind its RNA-OUT partner when the aptamer domain was 
not bound to its ligand [87]. This provides another means of dynamic control 
and, given the similarities between asRNA and sRNA, indicates that sRNA 
could be engineered for dynamic control by appending ligand-binding aptamer 
domains. 


10.2.4 Model-Driven Transcript Stability Control for Metabolic 
Pathway Engineering 


Ribozyme-catalyzed phosphodiester bond cleavage can affect mRNA half-life in 
primarily two ways (Figure 10.3b). Depending on the sequence context, and 
whether the target site is within a 5’ or 3’ UTR, cleavage may remove or alter 
secondary structures that influence RNase E or ribosome docking, resulting 
in differences in transcript stability, and, potentially, levels of gene expression 
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[21, 78, 88]. Phosphodiester bond cleavage within the 5’ UTR also removes the 
5’-P(PP) recognized by RppH or RNase E, which can lead to increased transcript 
persistence and gene expression [7, 21, 88] (see Section 10.1.2). We speculate 
that mRNA terminated with a 5’-OH is degraded via comparatively slow RNase 
E direct entry, consistent with increased half-lives measured for transcripts 
cleaved by hammerhead ribozymes [21]. 

Carothers et al. [7] formulated a model-driven process that uses UTR cleavage 
to engineer devices that regulate transcript stability and quantitatively program 
gene expression. Static ribozyme-regulated expression devices (rREDs) and 
dynamic, metabolite-controlled aptazyme-regulated expression devices (aREDs) 
were constructed that employ transcript stability, via 5’ UTR cleavage, as the 
underlying genetic control mechanism. With mechanistic understanding of RNA 
degradation pathways as a starting point [21, 88], a coarse-grained biochemical 
model of device function was created to simulate global device functions from 
local, measurable, and tunable component characteristics. The combinatorial 
space of design variable inputs was then mapped to the space of device outputs 
with a sampling-based approach, providing data for global sensitivity analysis 
(GSA) and identifying functional designs that meet targeted performance crite- 
ria. To physically implement functional devices, a novel method for designing 
transcripts with kinetic RNA folding simulations [89] was created that enables 
the assembly of individually characterized components parts. 

To demonstrate that variations in tunable design parameters generate quanti- 
tatively predictable outputs, genetic devices were constructed to program 
amounts of a reporter protein and production levels of p-aminophenylalanine 
(p-AF), a chemical precursor of bioactive compounds and advanced polymers, 
from a 12-gene engineered biosynthetic pathway. In total, 28 E. coli expression 
devices were assembled from component parts that were generated and charac- 
terized separately in vitro, in vivo, and in silico. Excellent quantitative agree- 
ment between the design specifications and the device functions (r=0.94) was 
observed, experimentally validating the underlying models and simulation tools 
and the overall approach. rREDs and aREDs have immediate utility as program- 
mable biosensors and controllers for metabolic pathways and genetic circuits. 
And, notably, this work also provides a conceptual and experimental framework 
for investigating and engineering complex RNA functions through the applica- 
tion of fundamental biochemical understanding. 

Using this framework, we envision a model-driven design process for creating 
RNA-based dynamic control systems for applications in metabolic engineering 
and biosensing (Figure 10.4). As a testbed for RNA-based control circuit design, 
we are engineering E. coli to produce p-aminostyrene (p-AS), a component of 
polymer composites with optical and mechanical properties favorable for 
advanced applications in photonics, photolithography, and biomedicine [90, 91]. 
Substituted styrenes have been difficult to chemically synthesize in high yields 
[92], and the cytotoxicity of key intermediates and products has prevented effi- 
cient microbial production [93]. The proposed p-AS pathway is an ideal testbed 
because it has 15 well-defined gene products and measurable intermediates yet 
presents a full complement of canonical control problems that must be addressed 
to obtain efficient production. 
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Figure 10.4 Model-driven design workflow for engineering gene expression with TSC. A basic 
systems-level model is used to identify goals for genetic device outputs. These goals inform 
the creation of a mechanistic model based on biochemical understanding, which is used to 
identify component specifications needed for device function. Components are then 
engineered and/or evolved with in vitro selection to meet the design specifications. Transcript 
design methods employing biophysical models of RNA folding are employed to enable the 
assembly of individual RNA components into functional devices. The mechanistic model is 
then refined to account for engineered component characteristics used to predict device 
outputs. Systems-level functions are obtained through the assembly of multiple static and 
dynamic RNA devices. 


Drawing on the design principles gleaned from naturally occurring metabolic 
control circuits [94, 95], one approach to optimizing p-AS production would be 
to implement dynamic controllers that operate as a function of flux through 
p-AF, a cell-permeable intermediate. Circuits comprised of static rREDs and 
dynamic p-AF-responsive aREDs could be constructed to program flux through 
the pathway. In principle, there are many possible control topologies and corre- 
sponding RNA-based feedback architectures that could be implemented to 
enable high levels of p-AS production. An important aspect of this work will 
therefore be to identify and experimentally validate the feedback architectures 
that can be implemented across the tunable biochemical parameter ranges. 

Finally, results showing the importance of robust folding to the design of 
functional rREDs and aREDs are consistent with the idea that kinetically driven 
co-transcriptional folding pathways significantly impact cellular RNAs [96]. 


10.3 Managing Transcript Stability 


Improving the ability to integrate biochemical models and refined RNA 
transcript folding design algorithms should therefore lead to better tools for 
engineering genetic control systems that employ RNA sequence and structure 
design to quantitatively program expression. 


10.3. Managing Transcript Stability 


10.3.1. Transcript Stability as a Confounding Factor 


Perhaps the greatest obstacle on the road to predictable biological engineering is 
the joint confounding effect of cellular subsystems that interact with synthetic 
biological components in unanticipated ways. In this regard, it is important to 
realize that any —and all—synthetic RNA in the cell is affected by the degradation 
systems regulating transcript stability. Except in cases where transcript stability 
is the explicit genetic control mechanism, efforts aimed at engineering gene 
expression tend to neglect dimensions of transcript stability. However, as new 
tools are developed, it should become much easier to circumvent limitations 
imposed by variations in RNA stability and instead fine-tune transcript stabil- 
ity in concert with other engineering strategies to rapidly implement genetic 
controls to meet performance requirements. 


10.3.2 Anticipating Transcript Stability Issues 


Because so many factors can affect RNA stability, it is important to consider the 
ways that experimental results may be impacted by unexpected changes in tran- 
script stability. Moreover, it may be prudent to routinely determine whether 
transcript stability is a parameter requiring attention, either through computa- 
tional design variable sensitivity analysis or through wet-lab experimentation. 
The roles and binding behaviors of all ribonucleases and associated proteins have 
yet to be elucidated [97], but study of major players such as RNase E, PNPase, 
RppH, Hfq, and RNase III has unveiled structures and many key roles of these 
enzymes. Though the binding interactions of these enzymes with RNA and each 
other are not completely understood, it is possible to analyze sequences and 
attempt to avoid unwanted degradation, or increase degradation, by changing 
codons within the open reading frame or UTR sequences to eliminate, or insert, 
putative binding sites. 

Computationally, the potential impact of transcript stability on a given syn- 
thetic biological device output can be assessed with GSA using coarse-grained 
mechanistic model simulations and Monte Carlo sampling [98]. With this 
method, the global space of potential designs is mapped by simulating genetic 
device outputs with Monte Carlo sampled values for the model parameters taken 
randomly from biochemically reasonable ranges. By computing quantitative 
GSA measures to relate the potential genetic device outputs to transcript stabil- 
ity parameter inputs (e.g., partial correlation coefficients) [98, 99], the impact of 
variations in RNA degradation rate, relative to other tunable design variables, 
can be readily discerned [7]. 
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Considering transcript stability, and how it may impact system function, is 
especially important when introducing secondary structure into a transcript, as 
this may cause unwanted RNase binding or result in premature transcription 
termination. See Section 10.3.4 for an example detailing issues arising from add- 
ing cis-repressor riboregulator RNA into a 5’ UTR in E. coli. Another example 
comes from efforts to obtain detectable signals in vivo from RNA aptamer-based 
fluorescent biosensors (i.e., “Spinach” aptamer conjugates). To do this, Paige 
et al. had to employ an RNase E-deficient E. coli strain [100] to circumvent limi- 
tations likely resulting from an otherwise short aptamer half-life. 

There are other potentially confounding effects that are more difficult to 
account for, but that should still be considered in the course of genetic device 
engineering. Large amounts of synthetic mRNA and/or regulatory RNA from 
complex circuit designs could lead to overloading the degradosome or associated 
enzymes such as Hfq, causing cell-wide RNA stability changes. A phenomenon 
of this sort is difficult to study, but the work of Hussein and Lim on competition 
for Hfq [49] suggests it is worth attempting to understand. They found that sRNA 
expressed without a target binding partner reduced sRNA effectiveness cell-wide 
by binding Hfq and limiting its accessibility. Expression of the target mRNA 
removed this problem, suggesting that balance in expressing synthetic sRNA can 
be critical. 


10.3.3 Uniformity of 5’ and 3’ Ends 


Variations in UTR sequence context may elicit differences in local secondary 
structure, which in turn may alter transcript stability and gene expression 
[78, 88]. One way to guard against such context-dependent transcript stability 
problems is to attenuate UTR variability effects by removing 5’ and/or 3’ RNA 
that may form undesired secondary structure (Figure 10.3b). Several studies 
have utilized removal of 5’ UTR secondary structures as a mechanism for 
minimizing context-dependent differences in gene expression. One involved a 
screen of “insulator” sequences and structures placed within the 5’ UTR. A 
ribozyme-hairpin combination, termed RiboJ, produced nearly identical 
transfer functions for two different genes under the control of three different 
promoters (several other ribozymes had similar effects) [78]. A second study 
used the Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) 
processing system from Pseudomonas aeruginosa strain UCBPP-PA14 to 
remove both 5’ and 3’ UTR sequences. At both ends, a 28-nucleotide repetitive 
sequence, recognized by the Csy4 endonuclease, was added, which resulted in 
efficient transcript cleavage and UTR sequence removal. Using the CRISPR 
system, they were able to show similar levels of protein production in the 
context of different promoter and RBS combinations in mono- and bicistronic 
systems, with green fluorescent protein (GFP) and red fluorescent protein 
(RFP) outputs [88]. 

Mutalik et al. recently published a scheme for minimizing 5’ UTR-induced 
variations in gene expression that involves introducing a standby RBS in a bicis- 
tronic design (BCD) [101]. The standby RBS is designed to cause ribosome bind- 
ing upstream of the real RBS (ie., the RBS from which translation of the desired 
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open reading frame predominantly initiates), disrupting secondary structure 
that may cause variability in ribosome binding to the real RBS. The use of a BCD 
resulted in a large improvement of transfer function correlations between multi- 
ple gene contexts. 

With the three individual systems in this section as starting points, it may be 
possible to combine them into a general-use template for attenuating 5’ UTR- 
and 3’ UTR-induced variations in gene expression. Further reductions in coding 
sequence-induced variability could come from work that identifies sequences 
and structures that, through targeted codon changes, minimize RNase E direct 
entry and RNase III binding. 


10.3.4 RBS Sequestration by Riboregulators and Riboswitches 


As mentioned in the Introduction, RBS sequestration can hasten mRNA degra- 
dation by leaving the transcript open to binding from RNases. Thus, riboregula- 
tors, riboswitches, and similar structures that sequester the RBS can be used to 
dynamically control mRNA stability, at least in part, via induction (or repression) 
by a trans-activating RNA or small molecule (Figure 10.3c). Riboregulators are 
functional RNA structures with two components: a cis-repressor and a trans- 
activator. The cis-repressor is generally 5’ UTR RNA that folds into a conforma- 
tion that base-pairs with the RBS, making it unavailable to ribosomes. This 
repression can be relieved by the trans-activator RNA, an RNA that interacts 
with the cis-repressor RNA so that the RBS is revealed for translation [54]. The 
mechanistic design can also be reversed, whereby trans-RNA binding changes 
the cis-RNA conformation to sequester the RBS [54]. Like riboregulators, ribos- 
witches can function to bind the RBS in cis and prevent or enhance translation 
by hindering, or allowing, ribosome docking. In the place of trans-RNA, ribos- 
witches control RBS sequestration with conformational changes mediated by 
small molecule ligand binding. Functional synthetic riboswitches have been 
developed to bind a variety of ligands [102-105]. 

As both riboregulators and riboswitches involve adding secondary structure 
into the 5’ UTR, their presence is likely to cause altered transcript stability due to 
changes in RNase binding site accessibility. It may prove useful to think of these 
systems in terms of their dual effects on translation rate and transcript degrada- 
tion rate, which will require gathering data related to half-life, and not just final 
gene expression and protein output. Both sets of data would be necessary to 
decouple the contributions of translation rate changes from transcript stability 
changes. 

Efforts by the Collins lab to engineer synthetic riboregulators show how tran- 
script stability changes can impact device outputs. The addition of cis-repressor 
RNA in the 5’ UTR of a particular GFP expression cassette significantly reduced 
protein expression, and repression was largely alleviated by the trans-activator 
RNA [9]. The synthetic riboregulator system was subsequently expanded to 
develop a microbial kill switch [106] and a genetic switchboard to regulate four 
carbon-utilization genes [107]. Though this system has been utilized success- 
fully, it is worth noting that inserting cis-repressor RNA into their constructs led 
to a 40% reduction in mRNA levels versus no cis-repressor RNA [9], which the 
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authors attribute to either premature transcript termination due to the cis- 
repressor secondary structure or the activities of RNases, such as RNase III, that 
cleave double-stranded RNA. 

Elsewhere, the space of trans-activating RNA targeting a fixed cis-repressor 
RNA, regulating GFP expression in E. coli, has been computationally explored 
[108]. In the cis-repressed state, where the level of genetic output was equivalent 
to 1-4% of the unrepressed state, activation by one of six designed trans-activat- 
ing RNAs increased GFP production 3-11-fold relative to the baseline. 
Quantitative reverse transcription polymerase chain reaction (RT-PCR) showed 
that the trans-activating RNA—mRNA ratio did not change in an RNase III 
knockout strain compared with the wild type. However, the relative genetic out- 
put induced by trans-activating RNA more than doubled in the RNase III knock- 
out, which, as mentioned, is consistent with the idea that variations in transcript 
stability can alter the performance characteristics of these kinds of control 
devices and systems. 

A riboregulator-like RNA, called an allosteric ribozyme, previously only char- 
acterized in vitro [109-111], has recently been used to control translation initia- 
tion in vivo [112]. A ribozyme, with the RBS sequestered in its secondary 
structure, was designed to autocatalytically cleave itself to expose the RBS and 
allow translation initiation. Trans-activating RNAs were designed to bind a com- 
plementary sequence within the ribozyme, inhibiting ribozyme cleavage and 
exposure of the RBS, leading to 10-fold reductions in relative EGFP expression. 
The question of how variations in transcript stability might be in play here has 
not been directly investigated. 

Overall, the work described here highlights promising approaches for engi- 
neering dynamic RNA-based control systems. It is also clear that, to improve 
engineering tractability, there is a need to investigate how introducing secondary 
structure—whether a cis-repressor RNA, riboswitch, or ribozyme-into the 
5’ UTR may lead to confounding effects on device outputs stemming from unac- 
counted-for RNase binding or premature transcription termination. 


10.3.5 Experimentally Probing Transcript Stability 


The determinants of synthetic transcript stability can be analyzed by experimen- 
tally measuring mRNA half-life, through expression studies, by the use of endo- 
nuclease gene knockout strains, and with computational RNA folding simulations. 
Quantitative gel electrophoresis [21], quantitative PCR, or RNA-seq [1] after the 
addition of a transcription-inhibiting antibiotic (e.g., rifampicin) can be done at 
intervals to determine average transcript half-life by quantifying transcript abun- 
dance as a function of time. A strategy using sRNA to quantify mRNA abun- 
dance changes has also been proposed [113]. Comparing measurements from 
cells with and without an RNase (via knockout) can lend insight into the RNase 
dependence of a phenomenon, though care should be taken to understand the 
global impact of an RNase deletion and how that may complicate data interpreta- 
tion. Folding simulation tools for calculating minimum free energy (MFE) sec- 
ondary structures [114-117] or kinetically driven co-transcriptional [118-120] 
folding trajectories can lend insight into whether secondary structure could be 
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problematic (see Section 10.4.1 for more information). As more researchers use 
these tools to better understand synthetic RNA function, the aggregated data 
will lead to further insights and design strategies that can take advantage of sta- 
bility or mitigate unwanted effects. 


10.4 Potential Mechanisms for Transcript Control 


10.4.1 Leveraging New Tools 


The advent of recent technologies, such as high-throughput RNA secondary 
structure elucidation, high-throughput RNA sequencing, and co-transcriptional 
RNA folding simulations, provides new ways to investigate transcript control for 
predictable gene expression engineering. Most artificial RNA-based control 
strategies have yet to take full advantage of these technologies to more predict- 
ably engineer synthetic systems. 

High-throughput RNA structural sequencing [121, 122] presents a new way to 
examine the structures of large numbers of RNA molecules. These techniques 
use RNA cleavage events dependent on the absence or presence of secondary 
structure, followed by high-throughput sequencing, to develop a map of base 
pairing probabilities. This map can then be used to constrain models from struc- 
ture prediction software [123-125]. If paired with half-life quantification, this 
methodology could provide better understanding of the connection between 
UTR structure and transcript stability, enabling more predictable introduction 
of secondary structure into transcripts. 

MFE simulations using tools like Mfold [114, 115] or RNAfold [116, 117] have 
been a mainstay in RNA secondary structure prediction. These tools calculate 
the lowest energy state of an RNA, which is interpreted to be the steady-state 
conformation of the RNA. When attempting to predict RNA secondary struc- 
tures inside cells, MFE calculations may be misleading, for at least two reasons. 
First, mRNA folding is co-transcriptional, and thus the full transcript sequence 
is not available for folding at all times. Second, the relatively short half-life of 
most mRNAs in the cell [2] can preclude their reaching the MFE conformation 
before degradation. To address these issues, there are software packages that 
take co-transcriptional effects into account [118-120] and can thus be useful for 
predicting UTR secondary structures more accurately in a cellular context. In 
fact, the creation of a transcript design method built around kinetic co-tran- 
scriptional RNA folding simulations was crucial for the rRED and aRED engi- 
neering described in Section 10.2.4 [7]. In that work, custom software written to 
implement kinefold [119] on a computational cluster enabled the design of 
spacer sequences to allow assembly of individually generated and characterized 
RNA parts into genetic devices with quantitatively predictable functions. There 
was significant divergence between the transcript folds predicted with MFE 
structure calculations and those obtained with kinetic simulations, underscoring 
the importance of RNA sequence and structure design that explicitly considers 
co-transcriptional folding. To extend those results, we are currently developing a 
computational platform for designing RNA parts, devices, and transcripts with 
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kinetic folding algorithms that should be broadly useful for analyzing and engi- 
neering TSC mechanisms [89]. 


10.4.2. Unused Mechanisms Found in Nature 


Despite more than a decade of progress, there are naturally occurring mecha- 
nisms for controlling transcript stability that have yet to be exploited for engi- 
neering gene expression. In the following, we enumerate several promising 
mechanisms that could become part of a toolkit to predictably control transcript 
stability. 

The lysC riboswitch, recently described by Caron et al. (Figure 10.5b) [28], 
contains two RNase E cleavage sites that are exposed, while the RBS is simultane- 
ously sequestered, in the presence of the ligand lysine. In a synthetic context, 
rational introduction of these sites into existing riboswitch designs could enhance 
their function by decreasing transcript half-life in the presence of ligand. 

An sRNA titration system has been observed by two different groups in 
Salmonella (Figure 10.5a) [126, 127]. This system functions by relieving sRNA- 
based repression by expressing a decoy mRNA to shunt away sRNA targeting an 
mRNA, leading to rapid degradation of the sRNA. Such a system could allow for 
quick removal of engineered sRNA-based repression that rapidly activates gene 
expression. 
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Figure 10.5 Examples of unutilized transcript stability control mechanisms. (a) In naturally 
occurring biological systems, decoy RNAs attenuate sRNA-mediated mRNA degradation by 
shunting sRNAs away from cognate transcripts, increasing message stability and the level of 
genetic expression. (b) The /ysC transcript of E. coli contains a riboswitch that controls access 
to the RBS and RNase E target sites, showing the functional integration of multiple TSC 
mechanisms. 


10.5 Conclusions and Discussion 


RNA-protein interactions occur throughout the lifetime of a transcript and 
play critical roles in the degradation process. As results with the bicistronic 5’ 
UTR design [101] highlighted in Section 10.3.3 suggest, some RNA-protein 
interactions can effectively inhibit other RNA-protein interactions. Although 
this cross-inhibition of RNA-—protein interactions sometimes confounds system 
behavior, this principle could be exploited as a TSC mechanism. For example, 
sequence motifs that recruit protective RNA-binding proteins to the 5’ and 3’ 
UTRs could insulate transcripts from ribonuclease binding. Pentatricopeptide 
repeat (PPR) proteins, a family of single-stranded RNA-binding proteins found 
in plants [128], would be a good candidate for this application. PPR proteins are 
similar to the DNA-binding transcription activator-like effector (TALE) proteins 
[129] in that each protein has an RNA-binding domain, the target specificity of 
which is governed by a series of two-amino-acid repeats, where each repeat cor- 
responds to a target nucleotide. The sequence motif code with which a class of 
PPR proteins binds RNA has recently been uncovered [130]. PPR proteins have 
been implicated in controlling transcript stability in maize chloroplasts by bind- 
ing to the 5’ and 3’ ends of mRNA [131], and a PPR protein was shown to limit 
5’ > 3’ and 3’ > 5’ degradation in vitro when its binding site was introduced into 
an mRNA [132]. 

Similarly to the PPR proteins, RNA has been found to bind the 3’ UTR of a 
transcript and enhance its stability by offering protection against exoribonucle- 
ase activity. GadY is an asRNA with complementarity to the 3’ UTR of the gadX 
gene in E. coli [56] and is probably one of many such asRNA. 

Though the biochemical details involving polyadenylation are not yet fully 
understood in bacteria, its ubiquity [133] and use in contexts such as in the glmS 
gene [63] for increasing degradation highlight potential utility for engineering 
TSC. As E. coli has only 3’ > 5’ exoribonucleases, degradation from the 3’ end is 
an essential part of rendering a transcript nonfunctional. Adding poly(A) tails of 
varying lengths— perhaps in an inducible manner similar to a riboswitch or 
riboregulator—to transcripts could function to reliably control and enhance 
3’ end degradation by PNPase. 


10.5 Conclusions and Discussion 


Knowledge of RNA degradation in bacteria has progressed substantially since 
the advent of synthetic biology. Key components and processes that account for 
bulk mRNA turnover, translation effects, sRNA action, and polyadenylation 
have become well understood. With this know-how and the continuing efforts of 
the RNA-based engineering community, TSC is positioned to become an even 
more powerful method for programming functions in synthetic biological 
systems. A forward-engineering approach that harnesses understanding of 
biochemical mechanisms to build predictive models for generating desired out- 
puts [7] is now possible, with a number of mechanisms to up- and downregulate 
transcript half-life. Building on existing RNA device engineering efforts, inspira- 
tion from natural mechanisms can point to new ways of regulating stability; and 
as RNA device engineering matures, more complex and wholly synthetic devices 
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that, for instance, fuse multiple control mechanisms will become easier to design 
and construct. Moreover, increasing knowledge of the underlying biochemical 
and biophysical principles governing RNA degradation will make it easier to 
anticipate how transcript stability may function contrary to genetic device 
output goals. We expect that TSC engineering will create ways to align mRNA 
half-life with design goals and thus will be a vital component of increasing 
system predictability and avoiding confounding effects. The current state of 
research portends the use of TSC to help design synthetic biological systems that 
can dynamically and rapidly respond to their environment with low-burden, 
orthogonal RNA components designed entirely in silico. 
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Definitions 


RNA synthetic biology Large-scale genetic engineering that makes use of 
RNA-based components (e.g., aptamers, aptazymes, riboswitches) to con- 
struct control devices and systems for programming cellular function 

Transcript stability control Regulation of genetic output by engineering 
mRNA degradation rate 

mRNA degradation The process by which a transcript is hydrolyzed to compo- 
nent monomers by the concerted efforts of the degradosome and associated 
enzymes 

Aptamer Functional RNA structure, typically generated through in vitro selec- 
tion, that binds target ligands 

Ribozyme Functional RNA structure with catalytic activity (e.g., hammerhead 
ribozymes catalyze phosphodiester bond cleavage reactions) 

Aptazyme Composite functional RNA structure consisting of an aptamer and a 
phosphodiester bond-cleaving ribozyme such that the catalytic activity is 
modulated by aptamer ligand binding 

Riboswitch RNA structure that controls gene expression by employing a ligand- 
binding aptamer domain that regulates access of the ribosome to the ribosome 
binding site in cis 

Riboregulator A riboswitch-like RNA unit that regulates access to the ribo- 
some binding site in response to binding by a trans-RNA 

sRNA Small RNAs, usually tens to hundreds of nucleotides long, that bind 
regions of a target mRNA (typically near the 5’ UTR or start codon) to hasten 
mRNA degradation and prevent translation by occluding ribosome docking 


References 


Computational design Here, a methodology that utilizes biochemical and 


biophysical models to drive the construction of complex devices and systems 
with predictable functions 
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11.1. Introduction 


Small peptides, with as few as 6 aa’s (amino acids), can already bear a variety of 
functionalities for large areas of application. A prime example is peptides that 
bind fluorophores for color labeling of proteins. Established over 20 years ago, 
the fusion to fluorescent proteins revolutionized our ability to equip a protein 
with optical traits to follow its behavior in vivo [1]. Although fluorescent pro- 
teins persist as indispensable tools for in vivo imaging, their large size can in 
certain cases interfere with a protein’s function. Small peptide tags, which 
directly bind fluorophores, have therefore shifted into focus and have proven to 
be of decisive importance for visualizing and characterizing biological systems 
and processes in vivo and in vitro [2]. Besides imaging, other important applica- 
tions of small functional peptides include affinity tags for protein purification 
and interaction studies or peptides that serve as substrates for proteolytic 
activities, enabling the control over protein turnover for synthetic biology 
applications [3-5]. 

Besides such applications more common in basic research, small peptides have 
proven important as pharmaceutically relevant agents. Small antimicrobial 
peptides show potential as novel broad-range antibiotics and are already used in 
food industry, while the immune modulatory effects of peptides in humans point 
toward applications as new immune therapeutic agents [6]. Peptide epitopes or 
carbohydrate-mimicking peptides (CMPs), when inserted into a protective 
protein scaffold, show potential to turn into novel vaccines [7, 8]. To this end, not 
the modified protein is central but rather the peptide itself with the protein being 
used as a mere scaffold. 

Traditionally, functional peptides are simply stitched onto either the N- or 
C-terminus of a protein. However, it is often essential to insert the peptide into 
the middle of the protein at a permissive site, which accepts additional aa’s. 
Reasons for this include the situation where (i) termini of a protein might be 
functionally relevant or not accessible [9-11]; (ii) internal fusions might be 
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more resistant to proteolytic degradation than terminal fusions [12]; (iii) the 
peptide might need to be structurally stabilized to exhibit its function as it is 
especially the case in vaccine development using epitope tags or small molecule 
mimicking peptides [13-16]; or (iv) the specific function delivered by the pep- 
tide needs per se to be introduced into the middle of the protein as it is the case 
for the engineering of cleavable proteins by insertion of a protease cleavage site 
[5, 9, 17, 18]. 

For these reasons, the focus of the subsequent discussion will be on proteins 
where the peptides were integrated into the sequence rather than positioned at 
the N- or C-terminal end. 


11.2. Permissive Sites and Their Identification 
in a Protein 


Sites within proteins at which large insertions are tolerated without loss of 
structural integrity and activity represent an extreme in the spectrum of 
sequence flexibility and have been called permissive sites [19]. Although it was 
originally assumed that permissive sites generally correspond to surface regions 
at which the added sequences do not disrupt overall folding [20], it is in fact 
hard to predict rationally where insertions will be tolerated even in the pres- 
ence of detailed structural knowledge. It is also not clear whether all proteins 
and enzymes are similarly tolerant to insertions. Traditional ways to explore 
the structural flexibility of a protein and to identify permissive sites have there- 
fore been random library approaches. A few studies suggest that permissive 
sites can be identified more rationally through comparative sequence analysis 
ie 181, 

Two main library approaches have been applied for the identification of per- 
missive sites in various proteins: the first is to generate insertions by limited 
digestion of a plasmid-encoded target gene using different restriction enzymes 
and religating it with a resistance cassette to be able to select for successful inser- 
tions. The resistance cassette needs to be flanked by unique restriction sites to 
subsequently excise the cassette and leave the gene of interest with a defined 
oligonucleotide insertion. The method does not enable to completely query the 
possible insertion space, but depending on the number of enzymes chosen for 
digestion, a sufficient degree of coverage can be reached [23, 24]. The second 
approach is insertion mutagenesis mediated by transposons, also known as 
linker insertion mutagenesis. Transposons are mobile genetic elements, which 
quasi-randomly insert in any DNA sequence mediated by the action of its 
corresponding transposase. Like this, any sequence can be delivered, ideally 
randomly, into a gene of interest as long as it is placed between the two trans- 
posase-specific recognition sites. In the simplest case, the transposon consists of 
a resistance marker, which is flanked by unique restriction sites as well as the 
transposase recognition sites. Subsequent excision of the resistance marker 
results in a characteristic in-frame fingerprint, which is composed of sequences 
from the restriction sites, the transposon ends, and target site nucleotides 
that were duplicated during the primary transposition event [18]. In addition, 
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virtually any user-defined sequence can be added into the transposon design 
such that it remains in the gene after excision of the selection marker. 

By these random approaches, permissive sites that accept short three-residue 
linker insertions have been explored for the enzymes f-lactamase and f- 
galactosidase, revealing a variety of phenotypes depending on the nature of the 
inserted residues [23, 25]: insertions with similar physicochemical character as 
the neighboring aa’s (regarding hydrophobicity, acidity, and charge) had less 
effect on enzyme functionality than physicochemical distant residues. 

In another study, linker insertion mutagenesis of TEM1 f-lactamase revealed 
that two residue insertions into predicted B-sheets abolished enzymatic activity, 
while insertions into predicted reverse turns only affected the degree of activity 
but did not completely cause loss of function [26]. Further, in some cases, inser- 
tions of four residues abolished enzymatic activity, while insertion of two resi- 
dues into the same site did not cause complete loss of lactamase activity. 

Permissive sites accepting longer insertions—like the seven-residue-long 
tobacco etch virus (TEV) protease cleavage site — and were in addition accessible 
for efficient cleavage by the corresponding protease were explored for a variety 
of integral membrane proteins (like the pullulanase secretin PulD from Klebsiella 
oxytoca [27], the protein transporter FhaC of Bordetella pertussis [28], and 
Escherichia coli lac permease [20]) in order to study structure—function relation- 
ships. Further, random insertions of a 31-residue mostly hydrophilic peptide — so- 
calledi31 libraries — were studied for the membrane-inserted maltose transporters 
MalG and MalF [29]. In both cases —for TEV cleavage site insertions as well as 
for i31 insertions, permissive sites allowing functional insertions were found to 
be mostly located in periplasmic turns or surface loops but not in parts spanning 
the membrane or in regions necessary for multimerization with interaction part- 
ners. Sequence insertions into nonpermissive sites affected folding, membrane 
insertion, multimerization, and overall functionality. 

Only in a few cases permissive sites were successfully explored for cytosolic 
proteins. The same i31 libraries as mentioned previously were used for the map- 
ping of functional domains and further for the identification of permissive sites 
in the cytosolic adenosine triphosphate (ATP)-binding component of the malt- 
ose ATP-binding cassette (ABC) transporter of E. coli MalK, the regulator of the 
lac operon Lacl, and the F-plasmid-derived relaxase Tral [29-32]. 

Further, a random transposon-based approach was used to successfully iden- 
tify permissive sites in the essential E. coli chaperonin GroEL by delivering a TEV 
cleavage site through transposon mutagenesis [9] as well as in the essential 
Saccharomyces cerevisiae glycosylphosphatidylinositol (GPI)-anchored mem- 
brane protein Dcwl [10]. 

Identification of permissive sites within the mentioned proteins, all of which 
are spatially rather complex assemblies, indicates that other less challenging pro- 
teins might be able to accept even larger insertions at certain positions. However, 
as it was already shown for short insertions [23], the permissiveness of a certain 
site depended on the size and the character of the inserted sequence and the 
functionality of a certain insertion needed to be evaluated for each case. 

Still, the current literature suggests the widespread existence of permissive 
sites for peptides of a length between a few and a few dozen residues. 
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11.3. Functional Peptides 


11.3.1 Functional Peptides that Act as Binders 


Peptides can specifically bind to small molecules, metals, or proteins. The ability 
of peptides to bind small molecules is often exploited for optical purposes. One 
of the most widely applied peptide tags and the first described alternative to the 
color labeling by protein fusions with fluorescent proteins was the small 6 resi- 
due-long tetracysteine tag (TC-tag) with the sequence CCPGCC. The TC-tag 
was rationally designed to covalently bind the arsenic green fluorescent dye 
FIAsH (fluorescein arsenical helix binder) whose fluorescence increases 1000- 
fold upon binding to the polypeptide tag. By now, a number of different bisarsen- 
ical fluorophores and corresponding tags have been developed [33-35]. Redesign 
of the FIAsH binding motif CCPGCC to bind the cyan dye AsCy3 furthermore 
allows for simultaneous multiple-color labeling [35]. The AsCy3 binding motif 
has the sequence CCKAEAACC, and discrimination between the two dyes is 
based on the larger interatomic distance between the two arsenics in AsCy3 
(14.5 A) than in FIAsH (6A). Due to its small size, the TC-tag and its derivatives 
have already resulted useful as a tool for in vivo imaging in bacterial [36] as well 
as eukaryotic cells [37], enabling experiments not possible with large fluorescent 
protein reporters. However, the method suffers from high background labeling 
by binding of the arsenic dyes to thiol-rich biomolecules, and extensive washing 
steps need to be applied to gain highly specific labeling [38]. 

Besides the TC-tag, other fluorophore-binding peptides have been developed: 
generally known as affinity tags for protein purification, the 6x histidine tag 
(6xHis-tag) was shown to bind metal-—nitrilotriacetate (NTA)—chromophore 
conjugates [39] and a zinc-chelating membrane-impermeable fluorophore 
called HisZiFit [40]. This enabled the site-specific labeling and tracking of 
the stromal interaction molecule STIM1, a membrane protein for which an 
N-terminal fluorescent protein fusion had been shown to interfere with surface 
exposure [40], exemplifying again the advantage of peptide tags over protein 
fusions. However the binding affinity of the mentioned dyes to the 6xHis-tag 
was only moderate and restricted the application to extracellular labeling of cell 
surface proteins. 

Next to these rational approaches for the design of labeling tags, directed evo- 
lution was shown to yield peptides with binding properties. Phage display was 
used to evolve a peptide tag that binds the dye Texas Red (called “Texas Red 
aptamer”) and its calcium-sensing derivative X-rhod with high affinity. This way, 
the authors developed a 28-residue-long calcium sensor that can be “hijacked” to 
various cell compartments depending on the cellular localization of the protein 
to which the Texas Red aptamer is fused [41]. 

The same approach was used to evolve a lanthanide binding peptide (LBT), 
specifically a terbium(III)-binding peptide, of 15 residue lengths for lumines- 
cence studies [42, 43]. LBTs that bind different lanthanide ions had already been 
employed before for NMR studies [44] or X-ray crystallography [45]. Interestingly 
it was shown that the insertion of LBTs into internal loops of a protein helped in 
rigidifying the peptide. This made internal fusions superior to terminal fusions 
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for phase determination in X-ray crystallography and potentially also for the 
structure determination of large protein complexes by NMR [13]. 

Peptides can exhibit high affinity for metals, other peptides, tags, or proteins. 
Such affinity tags are already widely applied for protein purification, immobiliza- 
tion, and pull-down studies [46]. Monoclonal antibodies for most affinity tags 
are commercially available, making (parallel) detection of tagged proteins possi- 
ble, thus circumventing the need for protein-specific antibodies. Internal tagging 
expands these applications to proteins with functionally relevant or inaccessible 
termini. An affinity tag, which is considered to be particularly suited for inser- 
tion into internal permissive sites, is the small, uncharged Strep-tag. The nine- 
residue peptide sequence exhibits intrinsic affinity toward Strep-Tactin, a 
specifically engineered streptavidin [47]. Due to the highly specific but non- 
covalent binding, proteins can be purified under physiological conditions in one 
step from crude cell lysates, without the need for high salt concentrations or 
other additives [48]. 


11.3.2 Peptide Motifs that are Recognized by Labeling Enzymes 


Peptides can serve as specific substrates for enzymes. Highly specific binding of 
small molecules to peptides is often hard to achieve. Like this, enzyme-mediated 
labeling has received attention as an alternative methodology to tag cellular pro- 
teins with chemical probes. Here, enzymes selectively act on a specific peptide 
sequence to covalently add their cognate substrate. One such enzyme is lipoic 
acid ligase (LpIA) from E. coli. Naturally responsible for attaching lipoid acid to 
proteins involved in oxidative metabolism [49], LplA was rationally redesigned 
to specifically attach useful small molecule probes — such as alkyl azides [50] and 
photo-cross-linkers [51] — onto an engineered 22 aa’s long LplA acceptor peptide 
(LAP1). The authors used this technology to label cell surface proteins and to 
map protein-protein interactions in vitro. Using yeast display for affinity selec- 
tion, the originally used engineered 22-residue acceptor peptide LAP1 could be 
resized to only 13 residues (LAP2) while at the same time showing a 70-fold 
higher catalytic efficiency (Kea:/Km) as a substrate for LplA [52]. By structure- 
guided mutagenesis, LplA was then further evolved to accept a fluorescent cou- 
marin derivative instead of a lipoic acid derivative as substrate [53]. The resulting 
variant LpIAW37V — together with the optimized acceptor peptide LAP2 —- made 
the method suitable for in vivo labeling of proteins in eukaryotic cells. In contrast 
to the previously used fluorescent lipoic acid derivatives, the employed coumarin 
derivatives were orthogonal to eukaryotic metabolism. The original LAP1 pep- 
tide, which had been employed in cell surface and in vitro assays, was then revis- 
ited to develop an in vivo protein-protein interaction assay: the relatively low 
affinity but good catalytic activity of LplA for LAP1 allowed to render fluoro- 
phore attachment protein-protein interaction dependent [54]. Attachment of 
LAP1 and LplA to potential dimerizing partners allowed to sufficiently discrimi- 
nate between an interacting and a noninteracting protein pair according to the 
labeling efficiency [54]. Altogether, two powerful alternative methods for the 
highly specific but unobtrusive labeling of proteins for imaging and interaction 
studies in vivo were introduced: PRIME (PRobe Incorporation Mediated 
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by Enzymes) and ID-PRIME (Interaction-Dependent PRobe Incorporation 
Mediated by Enzymes). Meanwhile, several improvements regarding kinetics of 
labeling and diversification on the derivatization of the PRIME substrates were 
reported [55—58] as well as examples for their application in solving biological 
questions [59]. 

A similar labeling strategy is based on the biotin ligase BirA from E. coli [60] or 
its analogs from yeast (yBL) and Pyrococcus horikoshii (PHBL) [61]. Biotin ligases 
covalently attach biotin to a lysine in a 15-residue biotin acceptor peptide (BAP) 
[62]. Like this, biotin analogs that are accepted by BirA can be attached to pro- 
teins labeled with BAP [63]. Although BirA action is orthogonal to eukaryotic 
biotinylation, the method is essentially restricted to protein labeling on the cell 
surface as endogenous biotin still serves as a much better substrate than the cor- 
responding derivatives. The BirA/BAP pair has been used for proximity studies 
on cell surfaces and to image communication across cells by transsynaptic bioti- 
nylation [64]. 

Clearly each peptide-based labeling method has its advantages regarding 
host range, complexity of the labeling approach, and optical properties of the 
employed fluorescent or luminescent labels. Thus, the choice of the appropriate 
labeling technique needs to be carefully considered when designing an in vivo 
labeling experiment. Some excellent recent reviews elaborate in more detail on 
these topics [65-67]. So far, labeling of protein at internal sites has not been 
extensively explored, although it would add more flexibility in experimental 
design than only focusing on terminal tagging. 


11.3.3 Peptides as Protease Cleavage Sites 


Peptides can be used to influence the degradation kinetics of proteins for appli- 
cations in basic science, synthetic biology, and biotechnology [4, 5, 68, 69]. 
Tuning a protein’s stability in vivo can be achieved through N- or C-terminal 
degradation tags [70-72]. Alternatively, the process of protein inactivation can 
be rendered conditionally, for example, by the insertion of a protein cleavage site 
into an internal permissive site, which is recognized by a specific (ideally host- 
orthogonal) protease. Nuclear inclusion protein a (NIa) proteases obtained from 
viruses of the family Potyviridae are the most promising target proteases due to 
their high activities and sequence specificity. Potyvirus proteases are responsi- 
ble for processing the potyvirus polyprotein into its functional units [73]. The 
best-characterized and most commonly applied member is TEV protease, which 
recognizes the consensus sequence ENLYFQG, with cleavage according between 
Q (P1 positions) and G (P1*) position. TEV is relaxed toward substitutions in 
the P5, P4, P2, and P1* position [74, 75]. This gives some freedom for cleavage 
site design, although cleavage efficiencies vary depending on the exact residue 
inserted. 

The TEV consensus sequence is not found in the proteome of mammalian cells 
[76], yeast [4], or E. coli [77]. Besides its application in protein purification where 
it is frequently used to cleave off affinity tags [78], proteolysis by TEV protease 
was used for in vivo studies as a tool to bleach essential proteins [4, 77], to 
study phosphorylation-dependent protein-protein interactions [79], to trigger 
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apoptosis in mammalian cells [76], or to establish gene networks for synthetic 
biology [5, 17]. Further well-characterized potyvirus proteases are the plum pox 
virus (PPV) protease, which recognizes the 7 aa sequence NVVVHQ/A [80] and 
the tobacco vein mottling virus (T VMV) protease being specific for ETVRFQG/S 
[81]. While PPV and TVMV proteases have been used for processing of fusion 
proteins in vivo and in vitro [82, 83], they have not yet been extensively explored 
as tools for systemic functionality studies of target proteins in vivo. However, 
potyvirus proteases are orthogonal to each other, meaning they cannot recognize 
the cleavage sites of each other efficiently [74, 84], which might facilitate com- 
bined employment in vivo, for example, for synthetic posttranslational modifica- 
tion networks. 


11.3.4 Reactive Peptides 


All before mentioned peptides offer “passive” functions like “binding” or “being 
recognized” and are acted upon by separately encoded enzymes. Peptides, which 
encode an “active” function, for example, catalytic activity, constitute an interest- 
ing expansion to the functional portfolio of peptides, especially when these activ- 
ities could be added in an orthogonal manner to the activity of an already 
functional scaffold protein. 

One outstanding example of a reactive peptide is the 13 aa SpyTag peptide that 
rapidly forms an isopeptide bond between a peptide-internal lysine and an aspar- 
tate residue in its target protein SpyCatcher (138 aa, 15 kDa) [85, 86]. In the dem- 
onstrated setting, SpyTag was reactive irrespective of the location in the scaffold 
protein (terminal or internal). Recently, a second orthogonal isopeptide bond 
forming peptide tag/target protein pair (SnoopTag/SnoopCatcher) was intro- 
duced that is completely orthogonal to the SpyTag/SpyCatcher system [87]. Both 
pairs have been used together to build synthetic polyproteins [87], to design opti- 
mized vaccines [88], and to assemble bioactive protein hydrogels [89]. The 
hydrogel assembly was achieved by combining internal and terminal SpyTag 
insertions. 


11.3.5 Pharmaceutically Relevant Peptides: Peptide Epitopes, 
Sugar Epitope Mimics, and Antimicrobial Peptides 


The structural flexibility of proteins to accept additional residues makes them 
suitable scaffolds to structurally stabilize or protect peptides from degradation. 
While the functional peptides that were discussed before can be seen as tools to 
facilitate the study of properties or the (modulation of the) in vivo behavior of a 
certain protein of interest, the functional peptides of the following section will 
themselves be the actual targets of interest, and the protein in which it is 
inserted is a tool to facilitate its production or application. This section is only 
meant to give a notion on the diversity of pharmaceutically relevant functions 
that can be adopted by peptides and to discuss the potential impact that an 
extended knowledge about permissive sites could make to the field of therapeu- 
tic peptides. For a broader treatment, the reader is referred to an excellent 
recent review [90]. 
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11.3.5.1 Peptide Epitopes 

The immune response elicited by a given pathogen is specific against certain 
exposed fractions of the pathogen’s proteome, called epitopes. For the design of 
novel vaccines, approaches are explored where known epitopes are taken out of 
their natural (pathogenic) context and inserted into a different protein scaffold 
that is, in contrast to the protein of the epitope’s origin, nontoxic and easy to 
purify in high yield. It was already shown a decade ago that a permissive site 
within B. pertussis adenylate cyclase toxin-hemolysin (ACT-Hly) can be used to 
deliver a CD8* T-cell epitope into antigen-presenting cells in vivo and induce 
protective antiviral as well as therapeutic antitumor cytotoxic T-cell responses 
[91-93]. Adenylate cyclase toxoids can penetrate a variety of immune effector 
cells. Variants with disrupted catalytic activity, which are still cell invasive, are 
therefore considered a potent scaffold for vaccine design. 

Further, the nontoxic B subunit of cholera toxin [12, 94] as well as the hepatitis 
B core particle-forming protein HBcAg (for hepatitis B core antigen) have been 
explored as potential vaccine scaffolds by insertion of relevant epitopes, for 
example, a hepatitis C-specific epitope or the HIV-1-neutralizing epitope 
[95, 96]. More recently, adenovirus (Ad) capsid proteins embody enormous 
promise for the realization of diverse vaccines [97-99]. Also computational 
strategies have been developed to guide the design of epitope-equipped protein 
scaffolds for conformational stabilization and immune presentation [14—16]. 

For the design of novel chimeric vaccines, two points can be extracted from 
the body of available literature that seem to be most relevant for consideration: 
(i) the protein that is supposed to serve as a scaffold should be highly immuno- 
genic by itself to elicit — next to the scaffold-specific response —a high antibody 
production against the inserted epitope and (ii) a solvent-exposed permissive 
sites within the scaffold should be known. The second point seems relevant for 
two reasons: Firstly, though it was shown that terminal epitope fusions are in 
principle able to elicit epitope-specific immune responses, proteins were prone 
to degradation that might interfere with the generation of the response. In con- 
trast, proteins with internal insertions were stably expressed [12]. Further and 
more importantly, it was shown for HBcAg-epitope chimeras that insertions in 
an internal permissive site showed higher epitope-specific antibody produc- 
tion than terminal fusions, especially when the chimeric protein was designed 
such that the inserted epitope replaced another immunogenic region of the 
scaffold [100]. 

Although some proteins with known permissive sites are available and 
employed for the design of vaccine chimeras, the field seems limited by novel 
scaffolds. Better knowledge on permissive sites and their identification in 
potentially attractive scaffolds could thus pave the way for novel vaccine 
strategies. 


11.3.5.2 Peptide Mimotopes 

Peptide mimotopes are short aa sequences that mimic small molecules or carbo- 
hydrates. By using a protein that naturally binds the target molecule, for exam- 
ple, monosaccharide-binding lectins, mimotopes can be selected from peptide 
libraries by phage display. Per definition, a selected peptide is considered 
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mimetic if it also interacts with several other proteins or receptors that are 
known to bind the natural ligand [101]. Mimicry is thus defined as “binding to 
the same proteins as the natural ligand” rather than resembling the physico- 
chemical properties or the molecular recognition characteristics of the ligand. 
As mimicry is rarely obvious upon comparing the chemical structures of ligand 
and mimotope, rational design of a peptide small molecule mimic is currently 
nearly impossible. 

Especially CMPs are of pharmaceutical relevance. Carbohydrates are often 
displayed on the outer surfaces of pathogens and tumor cells and are therefore 
potential immunological targets for diagnosis, antibody production, and vac- 
cine development. However, carbohydrates are intrinsically T-cell-independent 
antigens, which diminish their efficacy as immunogens [102]. Further, carbohy- 
drates are difficult to chemically synthesize in high yield particularly due to the 
absolute requirement for the correct stereoconfiguration [103]. In contrast, 
preparative production routes to peptides have emerged over the last years 
[104], and peptides have an absolute requirement for T cells, making them 
better immunogens. The conversion of carbohydrate epitopes to peptide mimo- 
topes has therefore potential to overcome the shortcomings of carbohydrate 
immunogens [6]. 

The first attempt to establish a CMP using phage display was done with the 
jack bean lectin concanavalin A (ConA), which binds o«-mannose. This effort led 
to the identification of the tripeptide YPY to which ConA binds with high affinity 
[105]. These studies were followed by the screening and successful identification 
of a variety of peptide mimics against various pathogen- and virus-associated 
mono- and polysaccharides of high complexity. For a detailed and comprehen- 
sive overview on available mimotopes and their applications, see [101]. 

The available studies support the remarkably high potential of peptides to 
mimic virtually any desired chemical monomer or polymer —the right peptide 
sequence simply needs to be discovered by an appropriate method such as phage 
display. Although CMPs are traditionally chemically synthesized and—to the 
best knowledge of this author — have not been inserted into a permissive site of a 
protein scaffold, they illustrate the great versatility of potential chemical charac- 
teristics and functions that peptides can adopt and that can even be expanded by 
directed evolution. Consequently, as exemplified for peptide epitopes, the inser- 
tion of mimotopes into protein scaffolds for structural stabilization or to simply 
use theses scaffolds as carrier for in vivo delivery or high-yield production might 
bear a great but unexplored potential. 


11.3.5.3 Antimicrobial Peptides 

Antimicrobial peptides are short, mostly cationic hydrophobic peptides with 
antimicrobial activity against a broad variety of microbes [106]. Although even 
di- and tripeptides with antimicrobial activity have been reported [107], their 
size usually varies from 7 [108] to about 60 aa residues [109]. Antimicrobial pep- 
tides adopt secondary structures including a-helices, relaxed coils, antiparallel 
B-sheets, and gamma-core motifs—two antiparallel BB-sheets connected by a 
short turn as found in defensin-like peptides and often including disulfide 
bridges. There is a relationship between structure and function, with amphipathic 
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a-helical peptides being often more active than structurally less defined peptides 
and peptides with the gamma-core motif often being very active [110]. 

Due to their overall positive charge, antimicrobial peptides can accumulate at 
the negatively charged microbial cell surface — which often contains acidic poly- 
mers in Gram-negative as well as Gram-positive bacteria. After self-mediated 
uptake, they insert into the cytoplasmic membrane, disrupting its physical integ- 
rity [111]. Some peptides can also cross the membrane and act on intracellular 
targets [107]. Besides their high potential as broad-range antibiotics, recent stud- 
ies point toward a second function: the cationic peptides — which are not only 
produced by bacteria in their fight to populate ecological niches but are also 
found in higher organisms as defense mechanism [112]—are modulators of 
innate immunity [113, 114]. This property is discussed to have potential for the 
development of novel anti-infective therapeutic strategies [115]. A comprehen- 
sive database containing the sequences and properties of animal and plant pep- 
tides is available [116]. 

To meet the needs of basic research and clinical trials, large quantities of 
highly purified peptides are required. Although some peptide antibiotics are 
synthesized non-ribosomally by complex peptide synthetases, most of the pep- 
tides are genetically encoded. Recombinant production in bacteria offers an 
attractive approach for cost-effective large-scale peptide manufacture. A data- 
base housing information on recombinant approaches to generate suitable 
amount of antimicrobial peptides for biological and structural studies has been 
established [117]. The field of antimicrobial peptide production therefore nicely 
exemplifies the attempt to overcome shortcomings in the chemical production 
of peptides—a field that could also be of great interest for the production of 
mimotopes. 

Most production approaches resemble the natural production mechanism of 
antimicrobial peptides in their host [118]: to protect the production host from 
peptide toxicity and the peptide from cellular degradation, the target peptides 
are produced as larger precursors that are then processed by proteases to release 
the actual active peptide moiety. Like this a variety of fusion partners and release 
strategies have been explored [119]. Besides host toxicity and degradation, also 
the intrinsic hydrophobicity of the peptides impairs its soluble production when 
overproduced in a bacterial host. Commonly used strategies involve solubility- 
enhancing fusion partners like thioredoxin and glutathione-S-transferase (GST) 
[120], but also small aggregation-promoting carriers, for example, PurF [121] or 
ketosteroid isomerase [122], were explored. The rationale for the latter is to 
channel the peptide fusion into inclusion bodies to circumvent host toxicity and 
degradation while still having easy access to the peptide. Release from a carrier 
can be achieved by chemical hydrolysis or by specific proteases like TEV 
protease [119]. 

However, current yields for purified peptides are limited to milligrams per liter 
culture and current efforts focus on finding novel scaffolds for efficient expres- 
sion. Again to the best knowledge of this author, peptide production using per- 
missive site within solubilizing scaffold has not been explored yet, but seems to 
be a promising alternative to current approaches, especially to address the prob- 
lem of peptide degradation during production. 


11.4 Conclusions 
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Literature suggests an astonishing versatility in peptide functionalities (Table 11.1). 
Now standard directed evolution and selection techniques have the potential to 
amplify this spectrum. Indeed, successful peptide engineering by directed evolu- 
tion has already been achieved as exemplified by the identification of novel 
chromophore-binding peptides or peptide mimotopes from phage libraries. 

Still, the exploitation of the full potential of functional peptides for the engi- 
neering of synthetic chimeras seems to be limited by the available knowledge on 
permissive sites and the need for relatively labor-intensive methods to identify 
them in a scaffold of interest. Therefore more comprehensive rational methods 
would be desirable that, together with recent advances in DNA modification 
methods on chromosome-level [123-125], might be a step toward the exploita- 
tion of the full potential of superfunctionalized proteins. 


Table 11.1 Available functional peptide tags. 


Function Length 
Tag substrate/enzyme _(aa’s) Application References 


Peptide binds small molecule 


Tetracysteine FIAsH, ReAsH 6 Intracellular fluorescent labeling [33] 
tag (TC-tag) of proteins in vivo and in vitro, 
eukaryotic cells, and bacteria 

6xHis-tag Ni-NTA 6 Extracellular labeling of proteins [39, 40] 

derivatives, 

HisZiFit 
Texas Red Texas Red and 38 Intracellular calcium sensing 41] 
aptamer derivatives 
Lanthanide Lanthanides 15-20 Extracellular labeling and in vitro [43] 
binding tag structural studies by NMR or X- 

ray crystallography 

Peptide acts as recognition site for labeling proteins 
Biotin Biotin 22 Extra- and intercellular labeling [63] 
acceptor derivatives/biotin of proteins in vivo and 
peptide (BAP) ligase eukaryotic cells 
Lipoic acid Lipoic acid 13-22  Extra- and intracellular labeling [53, 54] 
acceptor derivatives, of proteins in vivo and 
peptide coumarin eukaryotic cells 
(LAP1 and derivatives/lipoic 
LAP2) acid ligase 
Peptide is recognized by hydrolyzing enzyme 
TEV protease TEV protease 7 Cleavage of fusion proteins; has [75] 
recognition been explored as tool for 
peptide mediated posttranslational 


modification in systems biology 
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Table 11.1 (Continued) 


Function Length 

Tag substrate/enzyme _(aa’s) Application References 

PPV protease PPV protease 7 Cleavage of fusion proteins; has _ [82] 

recognition potential as TEV-orthogonal 

peptide tool for in vivo studies 

TVMV TVMV protease 7 Cleavage of fusion proteins; has [83] 

protease potential as TEV-orthogonal 

recognition tool for in vivo studies 

peptide 

Peptide is an affinity tag 

Strep-tag Strep-Tactin, 8 One-step purification, [47] 

avidin immobilization, detection 

Peptide is reactive 

SpyTag SpyCatcher 13 Isopeptide bond formation with [85] 
protein partner SpyCatcher 

SnoopTag SnoopCatcher 12 Isopeptide bond formation with [87] 
protein partner SnoopCatcher. 
The reaction is orthogonal to 
SpyTag/SpyCatcher 

Definitions 


Permissive site Sites within a protein where insertions of several amino acids 
are accepted without compromising folding or function 

Functional peptide Small amino acid sequence (defined here as approximately 
6-30 residues), which shows a stand-alone biological function 

Superfunctionalization Incorporation of an (orthogonal) functionality into the 
primary function of a protein by insertion of a functional peptide sequence 
into a permissive site of the target protein 

Protein scaffold A protein who’s primary function is to structurally serve as 
docking point for additional functions 

Protein engineering The design of new enzymes or proteins with new or desir- 
able functions 


Abbreviations 

aa amino acid 

FIAsH fluorescein arsenical helix binder 
TC-tag tetra cysteine tag 

LplA lipoic acid ligase 

LAP LplA acceptor peptide 

BirA biotin ligase 
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BAP biotin acceptor peptide 

LBT lanthanide binding peptide 

TEV protease tobacco etch virus protease 

PPV protease plum pox virus protease 

TVMV tobacco vein mottling virus 
CMP carbohydrate-mimicking peptide 
GST glutathione-S-transferase 
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12.1 Introduction 


Increasing numbers of chemicals are produced by various genetically engineered 
organisms. Those organisms possess biosynthetic pathways composed of 
enzymes that act successively on the emerging substrate, in order to produce the 
final product molecule. The efficiency of biosynthetic pathways is crucial for 
industrial processes, and various strategies for the optimization of production 
strains have been undertaken thus far. The most common strategies include 
(i) increasing the pool of available substrates and/or overexpression of the 
enzymes of the limiting biosynthetic steps [1-3], (ii) introducing heterologous 
enzymes with preferred kinetic characteristics [4], and (iii) inhibition of the non- 
desired branching of biosynthetic pathways [5, 6]. 

Although diverse, none of aforementioned approaches guarantee the optimal 
arrangement of the enzymes of biosynthetic pathways inside the producing 
strain. Even if overexpressed, the enzymes still float randomly in the cytoplasm, 
which results in nonoptimal metabolite flow. In living cells, biosynthetic pathway 
enzymes or other functional polypeptides are often brought together into 
multienzyme complexes through specific interactions, membrane anchoring, or 
organelle targeting mechanisms. This type of organization increases the local 
concentration of enzymes and their substrates and products and minimizes the 
concentration of intermediates that may be toxic or unstable or may represent 
substrates for branching reactions. We can view such multienzyme complexes as 
autonomous units, where the evolving substrate travels from one enzyme to 
another without dissociating into the bulk solution. Therefore, reaction interme- 
diates cannot be used by other competing biosynthetic pathways that synthesize 
non-desired side products. Due to the smaller characteristic distances between 
the consecutive enzymes in the pathway, reactions can run more efficiently. 

DNA scaffolding is an artificial approach to the design formation of multien- 
zyme complexes and will be described here. Similarly, the RNA molecule has 
been used asa scaffold for biosynthetic pathway enzymes [7]. Protein scaffolding 
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and organelle targeting more closely imitate the formation of natural enzyme 
complexes and were used in the first attempts to improve the biosynthetic path- 
way efficiency by designed substrate channeling [8-12]. 

Although the number and distribution of the enzymes in a multi-protein com- 
plex [8—12] (Chapter 13) could be programmed by the sequence of a polypeptide 
backbone, the three-dimensional arrangement of the polypeptides may be 
unpredictable due to the flexibility in the peptide linkers between the interaction 
domains (Figure 12.1a,b, Table 12.1). Designing the polypeptide backbone with 
the scaffold-guided protein domains is limited by the number of available protein 
dimerization domains. Finally, each protein interaction domain has different 
conditions under which it folds and forms the functional interactions. 


Product Product Product 
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CE 


Without seaffold’a Bundle of biosynthetic enzymes 


RNA aptamers used to localize 
position of biosynthetic linked to a polypeptide scaffold 


biosynthetic enzymes to an 
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: through dimerization domains 
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Figure 12.1 Spatial and temporal organization of biosynthetic enzymes based on different 
types of scaffolds. (a) Biosynthetic enzymes are typically randomly distributed inside the cell. 
The conversion of the substrate may therefore be limited by the diffusion rate and the 
concentration of the substrate and localization of the enzymes. (b) Immobilizing biosynthetic 
enzymes using synthetic protein scaffolds can bring the enzymes into close proximity and 
therefore enhance the metabolic flux. In the absence of a large superscaffold, the precise 
arrangement of enzymes is unpredictable and is limited by the tertiary structure of the 
protein scaffold. (c) Biosynthetic enzymes with predictable RNA binding sites have been 
assembled using synthetic RNA aptamers. The enzymes are in close proximity and in the 
predefined order, which enables faster conversion of the substrate into the end product. 

(d) An assembly line based on the DNA scaffold promotes positioning of biosynthetic 
enzymes in close proximity and the predefined order. The substrate conversion is faster with 
less unwanted side products. The enzymes are linked to DNA-binding domains, which 
recognize specific nucleotide sequences. The order of the enzymes can easily be changed by 
changing the order of the specific nucleotide sequence on the DNA program, which can lead 
to different end products. 


12.1 Introduction 


Table 12.1 Advantages and disadvantages between DNA, protein, and RNA scaffolds. 


Scaffold DNA Protein RNA 

Spatial Linear Bundled Linear 

orientation 

Order Highly predictable Unpredictable Predictable, however 
less than for the DNA 
scaffold 

Localization in Nuclei No limitation Cytosol 

eukaryotes 

Scaffold— Difficult to achieve Easy to achieve Easy to achieve 


enzyme ratio 


substantial amount of 
scaffold, ratio in favor of 
enzymes 


favorable ratio with 
gene expression 
regulation 


favorable ratio with 
gene expression 
regulation 


Scaffold— Similar, well-characterized, Variations in Limited number of 
enzyme predictable interactions strength, limited well-characterized 
interactions number of well- RNA binding domains 
characterized 
interactions 
Variability, Large number of zinc Limited number of — Limited number of 
number of fingers and other DNA- protein dimerization  well-characterized 
available binding domains is readily domains RNA binding domains 
elements available, engineered zinc 
finger domains 
Interference May bind to chromatin; Signal transduction May bind to 
with cellular selecting sequences that do domains usuallydo endogenous RNA 
metabolism not affect growth not interfere in molecules; selecting 


bacteria sequences that do not 


affect growth 


Niemeyer and coworkers [13] were the first to in vitro assemble enzymes on a 
DNA scaffold. They arranged NADH:FMN oxidoreductase and luciferase onto a 
double-stranded DNA scaffold using the biotin streptavidin linkage and showed 
that the immediate spatial proximity of the enzymes enhances the coupled activ- 
ity. Later, they showed the operational DNA scaffold using glucose oxidase and 
horseradish peroxidase covalently linked to the DNA [14]. This system was fur- 
ther evolved by Wilner et al. [15], using a supramolecular DNA scaffold, who 
linked glucose oxidase and horseradish peroxidase via a lysine residue to the 
DNA oligonucleotides that hybridized onto the DNA nanostructures. 

The DNA scaffold with conjugated oligonucleotides onto enzymes and assem- 
bled to DNA nanostructures is impractical to use in vivo. Conrado and cowork- 
ers [11] were the first to demonstrate the functional DNA scaffold in bacteria, 
where the enzymes were attached to the DNA-binding domains and scaffolded 
onto the DNA program. The principle of the DNA scaffold has some advantages 
in comparison with protein scaffolding (Figure 12.1c, Table 12.1). A DNA pro- 
gram sequence requires no maturation, and an ordered nucleotide binding motif 
can be selected at will, which provides huge orthogonality. The docking of the 
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anchoring of the DNA-binding protein to the DNA-target site is well character- 
ized. Due to the close proximity of the chimeric proteins bound to the pro- 
grammed nucleic acid sequence, other enzymes that might redirect synthesis are 
spatially excluded from the multi-protein complex. Designed DNA-binding 
domains can be used as fusion partners of biosynthetic pathway enzymes. These 
domains can share the same type of protein fold; they have similar affinity to the 
scaffold, and, therefore, the binding of all of the components of the biosynthetic 
pathway proceeds under the same reaction conditions. 

RNA scaffolding is in many aspects similar to that of the DNA; however, only 
a limited number of well-characterized RNA binding domains is available [7, 9] 
(Chapter 13) (Figure 12.1d, Table 12.1). The advantage of the RNA over the 
DNA scaffold is that it could be used in eukaryotes to organize the metabolic 
pathway into the cytosol, whereas the DNA scaffold is probably limited to the 
prokaryotes. 

Based on the pros and cons, the DNA scaffold localizes primarily in the nuclei 
in eukaryotic cells; therefore for eukaryotes, the protein and RNA scaffolds are 
the only choices. Moreover, the protein scaffold could be directed to micro- 
locations within the cells. The main advantages of the DNA scaffold are the 
simple DNA program design and well-characterized anchoring of the DNA- 
binding proteins to the DNA-target site, as well as orthogonality; therefore, they 
are recommended for use in bacteria over both the RNA and protein scaffolds. 


12.2 Biosynthetic Applications of DNA Scaffold 


DNA scaffold-assisted biosynthesis is a viable strategy for enhancing the meta- 
bolic product yield or production rate. This enhancement appears to arise from 
the proximity of metabolic enzymes bound to the DNA scaffold that increases 
the effective concentrations of the intermediary metabolites. In every tested 
case, the DNA scaffold-assisted biosynthesis implemented on existing meta- 
bolic pathways improved either the product yield or rate of product synthesis 
(Table 12.2). 


12.2.1 .-Threonine 


Lee et al. [10] devised a DNA scaffold to facilitate the production of L-threonine 
in Escherichia coli (Figure 12.2). The biosynthetic pathway composed of the 
homotetramer homoserine dehydrogenase (HDH), homotetramer threonine 
synthase (TS), and homodimer homoserine kinase (HK) was assembled on 
the DNA program using 4-fingered zinc finger domains binding to 12 bp DNA- 
target sequences, named artificial DNA-binding domains (ADBs). Metabolic 
enzymes were linked to the N-terminal site of the ADBs, and they report testing 
several designs of the DNA program. Initially, the influence of 8, 18, and 28-bp 
spacers between individual DNA-target sites on the L-threonine product rate 
was analyzed. In addition, the impact of the target sites from one to four, for a 
third chimeric enzyme in the L-threonine metabolic pathway (TS-ABD3), was 
evaluated. The DNA scaffold with an 8 bp spacer between the DNA-target sites, 
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Figure 12.2 The biosynthesis of L-threonine in E. coli is enhanced by the DNA scaffold. 

(a) The three-step conversion of aspartate semialdehyde to L-threonine. (b) Arrangements of 
DNA-target sites on the DNA program with indicated production rates for t-threonine are 
depicted. The DNA scaffold includes the chimeric proteins, homoserine dehydrogenase (HDH; 
E1), homoserine kinase (HK; E2), and threonine synthase (TS; E3), fused to DNA-binding 
domains (ADB). Consecutive arrangements of DNA-target sites for threonine synthase (E3), the 
third enzyme in the biosynthesis of -threonine, improved the production rate for L-threonine. 
The DNA-target sites specific for the individual chimeric proteins are separated with 8-, 18-, or 
28-bp spacers between each DNA-target site. The fastest rate of t-threonine production in 

E. coli was obtained with the DNA program [1:1:2], with DNA-binding sites separated by 8 bp 
(see also [10]). 


and with two copies of the TS ([1:1:2] 8bp), reduced the production time for the 
L-threonine by more than 50%, with the maximum yield produced within 24h of 
fermentation. For the strain without the DNA scaffold, it took 2 days to produce 
the same maximum yield of L-threonine. In addition, the concentration of the 
intermediate homoserine, which might inhibit the growth of the host cell, was 
reduced 15-fold. 


12.2.2 trans-Resveratrol 


We examined the ability to assemble trans-resveratrol (trans-3,5,4’-trihydrox- 
ystilbene) biosynthetic enzymes on DNA in the cytoplasm of E. coli using zinc 
finger DNA-binding domains, recognizing a 9-bp-long nucleotide sequence, as 
DNA-binding proteins [11]. The metabolic pathway for resveratrol has already 
been reconstituted in yeast, mammalian cells, and bacteria [7, 12, 16]. The pro- 
duction of the trans-resveratrol from 4-coumaric acid is a two-step reaction in 
which 4-coumaric acid is converted to 4-coumaroyl-CoA by 4-coumarate—CoA 
ligase (4CL). trans-Resveratrol is formed by the condensation of one molecule of 
4-coumaroyl-CoA and three molecules of malonyl-CoA by stilbene synthase 
(STS) (Figure 12.3). We used a low copy number expression plasmid with genes 
encoding for 4CL and STS, which were fused to the C-terminus of Zif268 and 
PBSII zinc finger domains, respectively. The DNA scaffold was present on sepa- 
rate high copy number plasmids. Different DNA programs with various spacer 
lengths (2, 4, and 8 bp) and numbers of program repeats (4 and 16) (Table 12.2) 
were tested, and almost 10 mg!" of trans-resveratrol was produced when the 
number of scaffold repeats was 4 and the spacer length between the DNA-target 
sites was 2 bp, which is 10 times more than with the fusion protein of 4CL and 
STS (Figure 12.3b) [11]. 
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Figure 12.3 A DNA scaffold enhances the biosynthesis of trans-resveratrol in E. coli. (a) In the 
biosynthetic pathway of resveratrol, the 4-cumaric acid is converted to resveratrol in a two- 
step reaction with the biosynthetic enzymes 4-coumarate—CoA ligase (4CL) and stilbene 
synthase (STS). (b) Close proximity of the 4CL and STS enzymes can be achieved by fusing the 
enzymes with linker polypeptides or by introducing DNA scaffolds where the enzymes 

(4CL or STS) are fused to the DNA-binding domains (Zif268 or PBSII). The chimeric protein of 
the enzyme and DNA-binding domain binds to a specific nucleotide sequence present on the 
DNA program. The DNA-target sites specific for the individual chimeric proteins are separated 
with a 2-bp spacer between each of four tandem repeats [11]. 


12.2.3. 1,2-Propanediol 


A biosynthetic pathway for 1,2-propenediol composed of methylglyoxal synthase 
(MgsA), 2,5-diketo-p-gluconic acid reductase (DkgA), and glycerol dehydroge- 
nase (GldA) in E. coli is well established [17] (Figure 12.4a). The biosynthetic 
enzymes were fused to the N-terminus of the zinc finger domains ZFa, ZFb, and 
ZFc, recognizing a 9-bp target, and the corresponding chimeras were placed on 
the same plasmid as the target DNA sequence [11]. Several enzyme-scaffold 
ratios were tested (Figure 12.4c,d), and the E. coli with the [1: 1:1], 12-bp spacer 
1,2-propanediol system produced almost five times more product than the 
unscaffolded control (Table 12.2). 


12.2.4 Mevalonate 


The biosynthesis of mevalonate is a three-step reaction composed of acetoa- 
cetyl-CoA thiolase (AtoB), hydroxymethylglutaryl-CoA synthase (HMGS), and 


12.3 Design of DNA-Binding Proteins and Target Sites | 247 


MgsA DkgA GIdA 
DHAP 225 Methylglyoxal 25 Acetol 25> 1,2-Propanediol 
(a) Et E2 E3 


AtoB HM HMGR 
Acetyl-CoA = Acetoacetyl-CoA ua HMG-CoA me Mevalonate 
(b) E1 


[1:1:1], [1:2:1], [1:2:2], [1:4:2], 
n=1, 2,4, 8, 16 Consecutive Bidirectional Bidirectional and 
4 and 12 bp spacer arrangement arrangement Consecutive arrangement 


(c) 
1,2-Propanediol 


[V:4:1]yg12bp <([1:2:1],4bp =[1:2:2],4bp = [1:1:1]44bp =< [1:1:1],12bp 
[1:2:1]; 9g 12bp <[1:2:1]4 12bp <[1:1:1]4, 12bp = [1:2:2],12bp = [1:4:2], 12bp 
Mevalonate 


(d) [1:2:1]4 46.92 12bp < [1:2:2]o44—12bp < [1:4:2],,_ 12bp 


Figure 12.4 Biosynthesis of (a) 1,2-propanediol and (b) mevalonate in E. coli. (c) Schemes of 
consecutive, bidirectional, and mixed consecutive and bidirectional arrangements of DNA 
scaffolds with different stoichiometry and positions of DNA-target sites that were tested for the 
improved biosynthesis of 1,2-propanediol or mevalonate. DNA scaffolds can be used to overcome 
the limitations in biosynthetic pathways that occur because of individual enzymes with lower 
activity, compared with other enzymes in the same biosynthetic pathway. By changing the order 
or number of DNA-target sites, we can increase reaction yields, fine-tune biosynthesis production, 
and minimize side products. If the first enzyme in the biosynthetic pathway is most active, others 
can be distributed on both sides around the first, resulting in 1:2 molar ratios in favor of enzymes 
with low activity. Such groups of enzyme binding sites can then be multiplied on the DNA 
scaffold to achieve better molar ratios between the DNA scaffold and enzymes. (d) Impact of 
different scaffold architectures on 1,2-propanediol and mevalonate production [11]. 


hydroxymethylglutaryl-CoA reductase (HMGR). The biosynthesis of meva- 
lonate in E. coli, as such or assisted by a protein scaffold, has already been pub- 
lished [8, 18]. The chimeric proteins between the enzymes of the mevalonate 
pathway and zinc finger domains were constructed [11] (Figure 12.4b). For the 
DNA scaffold design, the DNA-target sequences corresponding to each of the 
DNA-binding domains were placed on a separate plasmid, and the influence of 
the DNA-target sites arrangements on mevalonate production was tested. 
Similar to the resveratrol and 1,2-propanediol scaffolds, the mevalonate yield 
was increased up to threefold in the presence of [1:2:2], scaffolds (1=2, 4, and 
16) with 12-bp spacers, compared with the random scaffold control; however, 
the best mevalonate yield was achieved with the DNA scaffold containing the 
[1:4:2]i¢ program (Figure 12.4c,d). 


12.3 Design of DNA-Binding Proteins and Target Sites 


A self-replicating DNA plasmid in one or more copies is an ideal scaffold for 
any information processing; for example, the DNA sequence represents a pro- 
gram consisting of a series of blocks (DNA-target sites), which determine the 
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arrangement of the DNA-binding proteins along the DNA (Figure 12.1d). The 
main twist comes with the requirement that each of these DNA-binding pro- 
teins/domains is fused to a different functional protein. Therefore, the sequence 
of target motifs encoded by the DNA program also defines the arrangement of 
those functional proteins along with the order of the DNA-binding domains. 
Only by changing the sequence of a DNA program, either switching positions 
or adding new target sequences, can outcome be predicted in advance (Figure 
12.1d). This requires a method for the site-specific targeting of enzymes along 
the DNA surface. While there are 64 nucleotide triplets in the natural code for 
the 20 amino acids, there could be as many as 262,144 different motifs consist- 
ing of nine nucleotides. Zinc fingers [19] and TAL elements [20] can be designed 
to bind to almost any desired nucleotide sequence, ranging from 9 to as many 
as 18 nucleotides. Additionally, we can select the target nucleotide sequence for 
each available DNA-binding protein. 


12.3.1 Zinc Finger Domains 


There are more than 700 experimentally characterized zinc fingers in the data- 
base ZIFDB [21], offering a huge selection of building elements for synthetic 
biology [22, 23]. Moreover, zinc fingers have similar properties, such as binding 
affinity or stability, which is important, since we do not need to adjust the prop- 
erties of each separated part. The DNA program, therefore, represents a modu- 
lar approach for various synthetic biology applications. 

Up until now, only zinc finger DNA-binding domains were used to link bio- 
synthetic proteins to DNA scaffolds. Conrado et al. used five different zinc fin- 
ger domains (PBSII, Zif268, ZFa, ZFb, and ZFc) that were each comprised of 
three fingers, with a specificity for unique 9-bp DNA sequences [11, 24-26] 
(Table 12.2). Statistically, a 9-bp-long sequence could appear 1.2 times per 
genome in E. coli, if we assume that the nucleotide sequence distribution within 
the genome is random. Lee et al. [10] used ADBs with four fingers that recog- 
nized a 12-bp DNA sequence. All of the zinc fingers used were relatively short 
and bound the DNA with low nanomolar affinity. Crucially, the selected zinc 
finger domains should not bind functional regions of essential genes in E. coli or 
affect bacterial fitness. 

As an in vitro test of the system components, binding to DNA can be analyzed 
using surface plasmon resonance (SPR) [27] (Figure 12.5a) or split GFP technol- 
ogy [28]. The DNA binding of the candidate zinc finger domains can be fused 
with split fluorescent proteins. Reassembly of the split yellow fluorescent protein 
(YFP) and strong fluorescence indicative of YFP reassembly occur only in the 
presence of a DNA scaffold that contains neighboring binding sites for, for exam- 
ple, PBSII and Zif268, separated by only 2 bp (Figure 12.5b) [11]. 

To investigate whether zinc finger domains bind their cognate DNA targets 
in vivo, a simple B-galactosidase test for DNA-binding domain activity in E. coli 
was used (Figure 12.5c). The principle of this test is that an active zinc finger 
domain should bind to its specific target sequence in the Psyn promoter and act 
as a synthetic repressor, thereby decreasing the basal activity of this promoter 
and lowering f-galactosidase levels. 
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Figure 12.5 Targeting DNA in vitro and in vivo with zinc finger domains. (a) The binding 
affinity of zinc finger domains (e.g., Zif268) to their specific nucleotide target sequence was 
determined using surface plasmon resonance (SPR). With increasing concentrations of 
purified zinc finger protein, the response signal increases, indicating protein binding. 

(b) The zinc finger domain (PBSII) was fused to the N-terminal half (PBSII-nYFP), and Zif268 
was fused to the C-terminal half (cYFP-Zif268) of the yellow fluorescent protein (YFP). 
Purified PBSII-nYFP and cYFP-Zif268 protein chimeras were mixed, either with DNA scaffolds, 
containing PBSII, or Zif268 target sites separated by 2-bp spacer, or DNA scaffolds with 
random nucleotide sequences. Fluorescence was then measured [11]. (c) The binding of the 
DNA-binding domain (e.g., Zif268) in vivo was tested with the inhibition of B-galactosidase 
expression. The expression of the tested zinc finger was under the control of an arabinose- 
inducible promoter. The lacZ gene was controlled by the Psyy promoter, which contained 
either the zinc finger target site or random DNA target site (CTCTATCAATGATAGAG). 
B-Galactosidase activity is measured in the presence of 1% or absence (0%) of arabinose and 
normalized to the galactosidase levels of the unrepressed state. The B-galactosidase activity is 
detected when the DNA-binding protein (e.g., zinc finger A) is not expressed (no arabinose). 
Arabinose induces the expression of zinc finger A, which binds to the DNA-target site “a” 
upstream of the B-galactosidase gene, and represses the expression of B-galactosidase. 

If the DNA-target site “b” is not recognized by the DNA-binding protein, the expression of 
B-galactosidase is not affected. 


Taken together, the described results indicate that (i) zinc fingers retained 
DNA-binding activity when fused to different proteins and (ii) two orthogonal 
zinc finger domains can simultaneously bind their target sequences in a DNA 
scaffold and bring their fused protein domains into close proximity as evidenced 
by the YFP reassembly. 


12.3.2 TAL-DNA Binding Domains 


The recent discovery of the code underlying the nucleotide sequence recognition 
by TAL effectors allows the design of protein domains that can bind to almost 
any nucleotide sequence [20] (Chapter 13). Similar to the zinc finger proteins, 
the TAL protein domains also seem to be ideal DNA-binding proteins for use in 
DNA scaffold applications. The typical TAL recognition site of 15-20 nucleo- 
tides is more than sufficient to provide the specificity required to build DNA 
scaffolds, even when taking into account any cross-interaction with similar 
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DNA-binding sequences in the host genome [29]. The binding affinity between 
different TAL DNA-binding domains is similar, and in the nanomolar range, as 
for zinc finger proteins. Due to the practically unlimited number of different 
combinations, there is no concern with running out of DNA-binding sites, 
regardless of the number of desired scaffolded enzymes. 


12.3.3 Other DNA-Binding Proteins 


In theory, practically any DNA-binding protein could be used in DNA scaffold 
applications. With the exception of zinc fingers and TALs, many of the charac- 
terized DNA-binding proteins bind DNA as dimers or tetramers (TetR, CI, and 
others), which would complicate the construction of DNA scaffold molecules if 
one desires to bind enzymes in a predefined molar ratio. Nevertheless, they may 
be useful for applications involving oligomeric enzymes. 


12.4 DNA Program 


The DNA scaffold is, in principle, more flexible in scaffold designs than protein 
or ssRNA scaffolds. Since dsDNA forms a helical turn approximately every 10 nt, 
we can use this property to guide the relative orientation of the enzymes coupled 
to the DNA-binding domains. We can change (i) the spatial orientation of the 
binding enzymes by changing the spacer length between the DNA-target sites; 
(ii) the number of DNA scaffold repeats, allowing us to additionally tune the 
biosynthetic pathway; and last but not least (iii) the DNA scaffold, which enables 
us to modify the enzymatic stoichiometry. 


12.4.1 Spacers between DNA-Target Sites 


The program DNA is designed to organize biosynthetic pathway enzymes into a 
functional complex. Spacers in the DNA sequence separating the DNA-target 
sites on the program DNA determine the spatial orientation of chimeric bio- 
synthetic enzymes relative to each other (Figure 12.6). The binding sites for 
three-fingered zinc fingers span nine nucleotides but can be extended to 18-bp 
recognition motifs for longer zinc fingers, spanning from one to two DNA duplex 
helical turns, respectively. The binding sites for DNA-binding proteins are sepa- 
rated by spacers, which are nucleotides that are not occupied by DNA-binding 
proteins. The length of the spacer sequence is not coincidental, and the selection 
follows the three-dimensional structure of a DNA molecule. One turn of the 
DNA helix is 10.5 bp long, which roughly overlaps with the length ofa DNA mol- 
ecule encircled by one zinc finger domain recognizing and binding to 9bp. In 
order to have functional units on the same side of a DNA molecule serving as a 
DNA program, it is of high importance to select the right spacer length. The 
double helix of the DNA defines on which side of the helix the functional domain 
will be attached, which is defined by the length of the spacer between the DNA- 
target sites: a spacer of one or two nucleotides positions them very close, while a 


12.4 DNA Program 


Side view Front view 


850 bp 
[1]4-850 bp-[1], 


[1:1], 2 bp 


trans-Resveratrol (mg/l) 
ne) 
[1]4-850 bp-[1], 


4x 


2 bp 
[1:1]42 bp 


(a) (b) 


Figure 12.6 Spatial position of biosynthetic enzymes is defined by DNA program. 

(a) Scheme of two types of DNA scaffolds that differ in spacer lengths separating the DNA- 
target sites. A first scaffold plasmid (left) carries four copies of Zif268 and four copies of 
PBSII binding sites, separated by an insertion of 850 bp along part of the plasmid backbone. 
A second scaffold plasmid carries four copies of the Zif268 and PBSII binding sites separated 
by 2-bp-long spacers. (b) Enzyme clustering improves the production of trans-resveratrol, 
which was measured in Escherichia coli-expressing fusion enzymes (Zif268-4CL and PBSII- 
STS) with different DNA program plasmids [1],-850 bp-[1], and [1:1], 2-bp spacers 

(for details see Figure 12.4a) [11]. (c) The spatial orientation of the enzymes is governed with 
a spacer between the DNA-target sites. The 2- and 8-bp spacers orientate chimeric enzymes 
on the same side of the DNA program (up). The 4-bp spacer between the target sites 
orientates the enzymes on opposite sides of the DNA program (below). The best production 
of trans-resveratrol in E. coli was achieved when the binding sites for Zif268 and PBSII were 
separated with 8 bp [11]. 


spacer of four to five nucleotides positions the neighboring two functional 
domains to the opposite sides of the helix (Figure 12.6c). 

Initially, the impact of the clustering of metabolic pathway enzymes [11] was 
examined, and the DNA-target sites within the [1:1], scaffold were separated on 
the plasmid by either 2 or 850bp (Figure 12.6a). The [1]4-850bp-[1], scaffold 
provided the same number of binding sites on the plasmid for both enzymes but 
prevented the close proximity of the bound enzymes to one another. The fivefold 
enhancement in resveratrol production observed for the [1:1], scaffold was 
abolished when the binding sites for each enzyme were positioned far apart on 
the plasmid, indicating that close proximity of the pathway enzymes is important 
(Figure 12.6b). 

In the example of resveratrol biosynthesis, we [11] examined whether the 
three-dimensional positioning of individual enzymes effects production yields 
(Figure 12.6c). DNA scaffolds with DNA-target sites separating 2, 4, or 8bp were 
constructed. Considering the standard DNA topology, the 2- and 8-bp spacer 
position functional units were on the same site of the DNA scaffold, and the 4-bp 
spacer position functional units to the opposite site of the DNA program. In the 
case of the [1:1],¢ resveratrol system, the best product yields were obtained with 
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the DNA program, in which individual DNA-target sites were separated with 
spacer lengths of 2 or 8 bp, while a spacer length of 4 bp (where the enzymes are 
oriented to the opposite directions from the DNA duplex) showed a smaller yet 
measureable improvement over the free soluble enzymes. The impact of 4- and 
12-bp-long spacers between the DNA-target sites for the 1,2-propanediol and 
the mevalonate DNA scaffolds was also analyzed [11]. All scaffolds with 4-bp 
spacers between zinc finger binding sites were less effective than their 12-bp 
counterparts. 

Lee et al. [10] constructed scaffold plasmids to position ADB-enzyme fusions 
every 20 bp (8 bp spacer), 30 bp (18 bp spacer), and 40 bp (28 bp spacer), so that all 
scaffold-bound enzymes were on the same side of the DNA program in three- 
dimensional space. The scaffold with the 8-bp spacer sequence was associated 
with the most efficient L-threonine production, confirming the finding that close 
proximity of the metabolic enzymes enhances the product synthesis, most likely 
through substrate channeling. 

As demonstrated with the trans-resveratrol and the other biosynthetic 
pathways, the spatial orientation and clustering of the enzymes on the DNA 
scaffold are important. Due to the predictable nature of the DNA, it is possible 
to predict the enzyme orientation in situ that simplifies designing the DNA 
scaffold, which is important for larger enzymes that, due to the steric effect, 
might prevent binding of other enzymes on a DNA scaffold. 


12.4.2. Number of DNA Scaffold Repeats 


In addition to the length of a spacer, the number of repeats of the DNA scaffold 
is important for fine-tuning the biosynthetic metabolic pathway. 

Conrado et al. examined in detail the effect of increasing the number of scaf- 
fold repeats. They constructed scaffolds with enzyme-scaffold ratios in range of 
40:1 to 1:3 (eg., [1:1:1]; to [1:1:1]i6). A DNA program with DNA-target 
sequences for each of three-enzyme pathways for producing 1,2-propanediol 
was placed on the same plasmid as zinc finger chimeras. The best 1,2-propane- 
diol yield was obtained when the number of scaffolds was 4, regardless of the 
arrangement of the DNA-target sites (see Section 12.4.3), with 12-bp spacers 
between the binding sites. For DNA scaffolds with 4-bp spacers between the 
DNA-target sites, the number of repeats played no role [11]. 

For mevalonate production, which is also a three-enzyme metabolic path- 
way, the genes encoding the chimeric biosynthetic enzymes were not on the 
same plasmid with the DNA program, enabling alternations not only through- 
out the number of scaffold repeats but also with the copy number of plasmids 
with the DNA scaffold. The largest yield enhancement came from the 16 
repeats of the [1:4:2] scaffold. This was followed closely by several of the scaf- 
folds [1:2:2] with 2, 4, or 16 repeats (Figure 12.4). In agreement with the previ- 
ous results for 1,2-propanediol and mevalonate, a yield enhancement for the 
trans-resveratrol was observed when the number of scaffold repeats was 
decreased from 16 to 4. 

These improvements highlight the ability to impact biosynthesis via simple 
changes in scaffold design. The number of scaffold repeats can easily be 
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changed, not only by changing the DNA program repeats on plasmid DNA but 
also by changing the number of plasmids in a cell. This can be achieved by 
introducing different origins of replication, from low to high copy number 
properties. The biosynthesis of a metabolite represents an additional burden for 
the cell. If the burden is too high for the cell, the production of a metabolite will 
not lead to the maximal yield. By changing the number of scaffold repeats, we 
can determine the state where the production yield of the wanted biosynthetic 
product is maximal. 


12.4.3 DNA-Target Site Arrangement 


In addition to the length of a spacer between DNA-target sites and the number 
of DNA scaffold repeats, the stoichiometry of DNA-target sites for individual 
enzymes forming biosynthetic pathways could also be varied. This is beneficial 
for biosynthetic pathways with enzymes with different kinetics. 

It should be noted that different enzyme arrangements on plasmid DNA are 
possible. Different architectures are described as, for example, [E1,:E2):E3,], for 
a three-enzyme scaffold, whereas a, b, and c describe the enzyme stoichiometry 
within a single scaffold unit and 1 is the number of times the scaffold unit is 
repeated on the plasmid (Figure 12.7a). 

For the L-threonine scaffold, Lee et al. [10] used the following architectures 
[1:1:1], [1:1:2], [1:1:3], [1:1:4] with an increasing number of homodimer TSs 


Biosynthetic flux channeling 


Information processing 
(e.g., cascade of protein kinases) 


Figure 12.7 Applications of DNA-guided programming. (a) For many biosynthetic pathways, 
the first enzymes in the cascade are the same, and the end product is determined by the 
enzymes that are lower in the cascade. By immobilization of specific enzymes on a DNA 
scaffold, when others are left out, we can determine which end product will be preferentially 
synthesized. This is a powerful tool for influencing the biosynthetic flux to produce less 
unwanted products and a cleaner end product. (b) Similar to protein scaffolds, DNA scaffolds 
can be used for defining the order in which protein kinases phosphorylate each other or a 
chain of other posttranslational protein modifications. Signaling pathways can, therefore, be 
modulated using different scaffolds. The DNA scaffold could also be used for information 
processing such as rewiring intracellular signaling pathways and designing new protein 
networks for constructing new biological devices with selected features. 
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in a consecutive manner. The arrangement [1:1:2] with an 8-bp spacer pro- 
duced the best results, followed closely by [1:1:3], [1:1:4]. They found that the 
production rate was threefold higher than that of [1:1:1] (Figure 12.2a). 

For the 1,2-propanediol and mevalonate synthesis, the scaffolds were designed 
bidirectionally in the way that the first enzyme was flanked on both sites by the 
second, followed by the third enzyme [1:2:2]. In addition, a consecutive 
arrangement of the second enzyme [1:2:1] and [1:4: 1] for both 1,2-propanediol 
and mevalonate biosynthesis, the DkgA and HMGS, respectively, was tested. 
The DNA scaffold arrangement [1:2:1], 12bp spacer gave the best yield of 
1,2-propanediol, closely followed with [1:2:2], 12bp and [1:4:2], 12bp 
(Figure 12.4). The DNA scaffold [1:4:2] combines both the bidirectional and 
consecutive arrangement of DNA-target sites and functional units. For meva- 
lonate production, the [1:4:2]2 12 bp DNA scaffold gave the best yield, followed 
closely by the [1:2:2]2,4,16 12-bp scaffold [11]. 

In some biosynthetic pathways, the bottleneck is the conversion rate of a sub- 
strate into a product, which can be a substrate for the next enzyme in the meta- 
bolic pathway. By changing the arrangement of biosynthetic enzymes on a DNA 
scaffold, the imbalances in the enzyme kinetics can be overcome. It might be 
expected that the multimerization of functional enzymes could interfere with 
the formation of functional scaffolds; however, the biosynthesis of L-threonine 
depends on enzymes that are active as homotetramers and homodimers, and 
still, the DNA scaffold improves the production rate of L-threonine [10]. The 
fact that multimeric proteins might facilitate DNA scaffold cross-linking, there- 
fore building regions with locally elevated concentrations of metabolites (metab- 
olite microdomains), is dedicated for bioconversion. Moon and coworkers [30] 
showed a positive correlation between a glucaric acid titer and the number of 
scaffold interaction domains targeting upstream myo-inositol-1-phosphate syn- 
thase. In the mevalonate pathway, protein scaffolding generating microdomains 
enabled faster growth rates, likely minimizing the cellular accumulation of the 
toxic intermediate HMG-CoA in E. coli [8]. 

Taken together, the DNA scaffold is a useful tool to improve biosynthesis. The 
predictable nature of DNA enables the fine-tuning of metabolic biosynthesis and 
production yields. 


12.5 Applications of DNA-Guided Programming 


By studying different DNA scaffold architectures, enzyme stoichiometry, and 
flux balanced or imbalanced biosynthetic pathways, it should be possible to 
determine when the enzyme co-localization is most beneficial. This, in turn, will 
be very useful for guiding the future design of these systems and in envisioning 
new applications for enzyme co-localization. It is also worth mentioning that the 
DNA scaffold approach is highly complementary to many of the existing meth- 
ods for enzyme, pathway, and strain engineering that are already in the cellular 
engineer’s toolkit. Therefore, a successful strategy for achieving the production 
yields near theoretical maximum is necessary for the commercial viability of 
production processes and will likely involve a combination of these approaches. 


12.5 Applications of DNA-Guided Programming 


Many biosynthetic pathways are also branched, which means that the enzymes 
at the initial steps are shared and the enzymes after the branch differs, which 
determines the end products and their ratios. With DNA scaffolds where the 
order of enzymes can be changed, the end product could be determined by scaf- 
folding the selected pathway. This leads to the production of cleaner end prod- 
ucts and less unwanted products (Figure 12.7a). In addition, with a substrate to 
product channeling, which is achieved by the DNA scaffold, the accumulation of 
intermediate products that are toxic for the cell or that can significantly slow 
down the production rate is consumed faster. 

Moreover, the DNA-guided assembly could also be used outside the cell to 
support biosynthetic reactions in vitro, comprising (i) functional units, for exam- 
ple, biosynthetic enzymes linked to DNA-binding domains or linked to single- 
stranded oligonucleotides by chemical modification [13-15]; (ii) a DNA scaffold 
comprising one or more target site sequences; and (iii) a substrate for the first 
enzyme and cofactors for the enzymes provided to the mixture. Erkelenz et al. 
[31] generated a hybrid DNA-protein device based on the two cytochrome P450 
BM3 subdomains conjugated to oligonucleotides. The two conjugates arranged 
on a switchable DNA scaffold form active monooxygenase, which could be 
turned off by DNA strand displacement. 

DNA scaffolds could also be used to control the flow of different classes of 
biological information mediators that extend beyond the metabolic pathways 
and small molecule products. For example, DNA scaffolds could be used to 
rewire intracellular signaling pathways or to coordinate other assembly-line pro- 
cesses, such as protein folding, degradation, and posttranslational modifications 
(Figure 12.7a,b). Thus, we anticipate that DNA scaffolds should enable the con- 
struction of reliable protein networks to program a range of cellular events. Even 
though the beauty of nature’s most elegant compartmentalization strategies, 
such as a protected tunnel [32] or intracellular organelles [33, 34], has yet to be 
recapitulated by engineers, the use of DNA scaffolds is an important early step 
toward this goal. 

Strain development is still hampered by the intrinsic inefficiency of enzymatic 
reactions caused by simple diffusion and the random collision of enzymes and 
metabolites. Scaffolding strategies that promote the proximity of metabolic 
enzymes and direct metabolic intermediates through the catalytic assembly 
steps are promising solutions for the named problem [7-12]. Regardless of scaf- 
fold type, the enzyme assembly increases the local concentration of intermedi- 
ates around the enzyme on the scaffold, preventing the loss of intermediates by 
competing reactions and overcoming the problem of toxic intermediates due to 
the rapid conversion of inhibitors. 


Definitions 


The DNA scaffold is a DNA molecule that serves as a platform for the spatial 
organization of DNA-binding protein domains. The sequential order of the 
DNA-binding protein domains with their fusion partners is defined through 
the DNA-target sites positioned along the DNA molecule. The ordering of the 
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DNA-binding domains consequently defines the order of the enzymes of the 
metabolic pathway, which are genetically fused to the DNA-binding domains. 
The overall speed and effectiveness of reaction catalysis can be improved by 
the presence of a DNA scaffold 

A protein scaffold has similar characteristics to the DNA scaffold but is protein 
based. In contrast to the DNA scaffold, where no natural examples are known, 
protein scaffolds also occur in nature 

The DNA-target site or DNA-binding element is a nucleotide sequence that is 
recognized by the DNA-binding domain 

DNA program stands for the defined order of DNA-target sites on the DNA 
scaffold. Spacers in the DNA sequence separate the DNA-target sites on the 
DNA program, which determines the spatial orientation of the enzymes 
bound to the DNA relative to each other 

Substrate channeling is the transfer of a product of one enzyme directly to the 
next enzyme with minimal release into the bulk solution. The result of 
substrate channeling is an improved overall reaction efficiency compared to 
the situations where the enzymes are randomly distributed within the 
cytoplasm 

The synthetic DNA-binding protein is a designed protein that binds a prede- 
fined DNA sequence. Individual modules of zinc fingers or TAL proteins are 
used for the construction of synthetic DNA-binding domains. Each module of 
the zinc finger has a defined specificity for the nucleotide triplet on the DNA 
molecule. Similarly, each module of the TAL protein can bind a single prede- 
fined nucleotide 

Spatial organization is a defined arrangement of components in space. Within 
the context of metabolic engineering, this means that biosynthetic enzymes 
are fixed in a defined arrangement imposed by the scaffold 

The fusion protein or chimeric protein is a protein created through the 
joining of two or more genes that code for individual proteins or protein 
domains. In our case, this refers to the fusion of an enzyme and a DNA- 
binding domain 
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13.1 Introduction 


The ability to engineer cells with subcellular spatial precision is a very powerful 
and essential tool in synthetic biology. Specifically, co-localization of proteins, 
DNA, and RNA enhances metabolic output of enzymes [1, 2], allows novel regu- 
lation of gene expression [3-5], and can increase the specificity of therapeutics 
[6, 7]. This occurs primarily because co-localized macromolecules have high 
local concentrations, allowing their activities to be coordinated. Thus, better 
ability to organize proteins, RNAs, lipids, etc. into synthetic macromolecular 
complexes should enable diverse and more complex function than can be 
achieved by solely engineering individual parts. 

In this chapter, we illustrate how synthetic RNA constructs are advancing 
efforts toward in vivo spatial engineering. Natural noncoding RNAs already play 
structural and catalytic roles in cells. A breadth of studies has established design 
principles that can be used to predictably shape RNA secondary structures 
[8—11]. Structural malleability of RNA, the ease of expressing synthetic RNA 
constructs in cells, their stability, and advances in methods for assaying and 
imaging assembled structures are some of the many reasons why RNA is a useful 
scaffolding material. Synthetic biology efforts have demonstrated that carefully 
designed RNA can be used for subcellular targeting of probes, enzymes, and 
therapeutic agents. 


13.2 Structural Roles of Natural RNA 


RNAs perform numerous biological functions as canonical gene expression 
agents, catalysts, gene regulation switches, and structural scaffolds. These struc- 
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tural and catalytic roles of RNA are due in large part to the tremendous diversity 
of secondary and tertiary structures assumed by natural RNA and the fact that 
ribose sugars are more reactive than deoxyribose. RNA secondary structures can 
include intricate motifs like double helices, hairpin loops, bulges, pseudoknots, 
and right-angled turns [12, 13]. Aside from the Watson—Crick base pairing, RNA 
has the capacity to form Hoogsteen base pairs as well as wobble base pairs. Such 
interactions allow motifs to be connected in higher-order tertiary interactions, 
predominantly through the non- Watson—Crick base pairs [14, 15]. 


13.2.1 RNAasa Natural Catalyst 


Catalytic roles of RNA during translation, like the tRNA shown in Figure 13.1a, 
disrupted a simple view held by the central dogma that RNA exists merely to 
transfer genetic information from DNA to protein. Today we know that RNA 
has catalytic and regulatory roles in many other cellular processes as well. 
Regulatory RNA structures play a significant role in the control of translation 
initiation of several bacterial genes and in bacterial immunity [17]. RNAs affect 
expression in cis, by forming secondary structures near translation start sites of 
the mRNA. The cis regions can bind to regulatory proteins or other RNAs that 
affect translation in trans [17]. Other similarly dynamic regulatory RNA regions 
can consist of aptamers, which are nucleic acids that selectively bind ligands 


Natural parts 


tRNA Riboswitches 
(a) (b) 
Aptamers IncRNAs 


(c) i 


Figure 13.1 Prevalence and diversity of secondary structure in natural RNA. (a) The alanine- 
carrying transfer RNA shown here has the typical clover leaf structure common among tRNA. 
(b) The theophylline-binding riboswitch (from PDB: 1015_A) is a canonical riboswitch. 

(c) The PP7 aptamer [16] binds to the PP7 coat protein with low nanomolar affinity. 

(d) The Homo sapiens TERC IncRNA (NR_001566.1) is an example of a natural IncRNA that 
serves as a scaffold. 


13.3 Design Principles for RNA Are Well Understood 


[18]. Many metabolic genes are “switched” on or off, triggered by the binding of 
small molecule metabolites to some of these regulatory RNAs known as ribos- 
witches (Figure 13.1b) [19]. 


13.2.2. RNA Scaffolds in Nature 


There are also several instances of natural RNAs that are largely structural. Some 
natural RNAs are known to specifically bind the coat proteins of single-strand 
RNA phages. Such interactions help package the RNA into viral capsids. Some 
RNA phages that have well-characterized RNA-binding proteins include PP7 
(Figure 13.1c) [16], MS2 [20], and Qf [21]. These coat proteins also act as repres- 
sors of the viral replicase translation by specifically binding RNA hairpins near 
the origin of replication. In the bacteriophage ®29, a short (117-174 nt) sequence 
of packaging RNA (pRNA) helps to pack phage DNA into preformed capsids 
[22]. A DNA packaging motor is composed of a pentameric ring of pRNA, capsid 
proteins, dsDNA, and an ATPase [23]. Studies characterizing the specificity and 
stoichiometries of these interactions [16, 24—26] have laid the foundation for 
RNA-tagging-based applications that we look at in Section 13.4. 

RNA scaffolds are important in eukaryotic gene expression as well. Mammalian 
cells appear to extensively employ long noncoding RNAs (IncRNAs). These 
IncRNAs (Figure 13.1d) are rich with secondary structure motifs [27, 28], some 
of which bind and coordinate proteins on scaffolds that play important roles in 
epigenetic regulation [29, 30] and telomere maintenance [31, 32]. 

Thus, natural RNA diversity offers a template of diverse structure and function 
for synthetic biologists. In the following section, we look at how natural observa- 
tions have been translated into an understanding of the means to precisely engi- 
neer structure and dynamics of RNA. 


13.3 Design Principles for RNA Are Well Understood 


In order to design, build, and test structures at the molecular scale, one must 
understand the physical properties of the building material. In particular, if one 
uses a biopolymer such as a protein or nucleic acid to build a higher-order struc- 
ture, the folding properties of that polymer will dictate the structure. This is 
especially a challenge in the case of protein engineering, where protein structure 
is extremely difficult to predict ab initio [33, 34]. As a result, many protein engi- 
neers have focused on substituting functional rather than structural residues in 
existing proteins [35]. Unlike proteins, nucleic acids have a well-defined helical 
structure governed by a simple set of complementarity rules [36] with some 
exceptions such as wobble pairing and G quadruplexes [37, 38]. As a result, the 
structural and folding properties of RNA are generally well understood. In addi- 
tion, RNA is a dynamic molecule [39-42] that can self-assemble into structures 
in vitro [13, 43-46] and can be easily transcribed from a DNA template in vivo. 
RNA functionality can also be improved using in vitro selection [47, 48]. For 
these reasons, RNA makes a suitable material for constructing synthetic in vivo 
nanostructures. 
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13.3.1 RNA Secondary Structure is Predictable 


Most RNAs fold into a secondary structure consisting of a series of base-paired 
stems and unpaired loops. This secondary structure is largely determined by 
complementary bases within the primary RNA sequence. As a result, RNA sec- 
ondary structure can be predicted computationally using a variety of methods. 
This typically involves using a model of the free energy of RNA base pairing 
[49, 50] to determine the minimum free energy secondary structure [8-11]. 
Structures with near-optimal folds are also calculated by these software packages, 
since they may be of interest, and partition functions are used to determine the 
relative probabilities of particular secondary structures based on their energetics 
(Figure 13.2a) [8-11]. Additional factors, such as wobble base pairing, pseudo- 
knots, and dangling bases, are often incorporated into these calculations [8, 55]. 
Several software packages have been developed for the purpose of calculating 
DNA or RNA secondary structure. These include UNAFold, RNAstructure, 
NUPACK, and ViennaRNA [8-10, 55]. The software is typically implemented as 
a web server that can be used to run calculations using an online interface; it is 
also possible to install a local copy of the software. Each package has a somewhat 
different feature set (see Table 13.1 for details). For example, RNAstructure can 


Secondary structure prediction Synthetic parts 
Input: ACTGACTGACTG. .. Riboregulators Synthetic ribozymes 
Mfold, NUPACK, SP ee) 
| UNAfold, etc. —) a 
Output: ss, SS» SS3 is SS), 
lan CO Ly ae Ligand-regulated —Ligand-regulated 
\/ PF UN OY riboregulators ribozymes 
A Sh xt OO 
fs AT ko a, 
\ 7 \ « NZ eof 


RNA self-assembly 


(b) 


Figure 13.2 Design principles for RNA structure and function. (a) RNA secondary structure 
can be predicted from the primary sequence using a variety of software packages. (b) RNA can 
self-assemble into 2D or 3D structures in vitro. (c) Researchers have developed a variety of 
synthetic parts, such as synthetic riboregulators, synthetic ribozymes, ligand-regulated 
riboregulators, and ligand-regulated ribozymes [51-54]. (d) In vitro selection can be used to 
enhance the function of RNAs through iterative rounds of amplification and selection. 
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Table 13.1 Comparison of features between RNA structure prediction software packages. 


Feature NUPACK RNAstructure UNAfold ViennaRNA 
MFE calculation e e e e 
Partition function ° ° e ° 
Wobble pairing ° e e e 
Pseudoknots ° e ° ° 
Dangling bases ° ° e e 
Multi-strand interactions ° ° ° ° 
Uses SHAPE/NMR data ° ) ° ° 
Graphical User Interface ° ) ° ° 
Web Interface ° e ry e 


A filled-in circle indicates that the software package contains the feature in a row, whereas an empty 
circle indicates that the software package does not contain the feature in a row. 
MFE, minimum free energy. 


integrate user-supplied experimental data such as selective 2’ hydroxylation and 
primer extension (SHAPE) [56] or NMR to aid in structure calculation and has a 
convenient graphical user interface [10]. ViennaRNA is designed to be computa- 
tionally efficient for testing many RNA structures in batches rather than for 
analyzing individual species in more detail [8]. UNAFold is derived from mfold, 
which used the first dynamic programming algorithm for predicting RNA 
secondary structure [9, 57]. A particularly useful package for designing RNA 
structures is NUPACK, which can handle multi-strand interactions and allows 
the user to design sequences that have a propensity to assemble into a user- 
defined set of secondary structures [55, 58]. Given the diversity of software pack- 
ages for predicting RNA secondary structure, it is important to choose the right 
software package for one’s particular design needs. 


13.3.2. RNA can Self-Assemble into Structures 


RNA can self-assemble into geometrically precise structures in vitro (Figure 13.2b). 
This was first shown for small RNA molecules with four stem-loops (tectoRNAs), 
which self-assemble into 1D structures using kissing loops [59], but has since been 
extended to form a variety of geometrically precise 2D and 3D shapes [13, 43-46, 
60]. Of particular note are the in vivo RNA assemblies [1], which can self-assemble 
into 1D or 2D lattices. Although in vitro structures have traditionally been formed 
using a thermal annealing process, recent work has shown that single-stranded 
DNA tiles and bricks [61, 62] can self-assemble into discrete nanostructures 
isothermally and under biocompatible conditions [63]. Thus, it is possible to self- 
assemble a diverse range of scaffolds using RNA. 


13.3.3 Dynamic RNAs can be Rationally Designed 


Beyond structure formation, RNA also has the capability to dynamically reconfigure 
itself in response to small molecules or other ligands [39-42]. Such 
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RNAs-ribozymes and riboswitches, respectively—underscore the notion that 
RNAs can be dynamic molecules. However, RNAs can also be rationally designed to 
go beyond their natural function (Figure 13.2c). For example, synthetic riboreg- 
ulators can be designed to control genes in the presence of a user-defined input 
RNA molecule [51]. It is even possible to combine pairs of functional RNAs to 
form more complicated devices, such as by combining riboswitches with ribozymes 
[64], riboswitches with riboregulators [52, 65], or aptamers with transcriptional 
attenuators [66]. These compound RNA devices underscore the notion that RNA 
secondary structure can be programmed to achieve a range of dynamic functions. 


13.3.4 RNAcan be Selected in vitro to Enhance Its Function 


Another powerful technique that has aided the development of many functional 
RNA motifs is in vitro selection or systematic evolution of ligands by exponen- 
tial enrichment (SELEX) [47, 48] (Figure 13.2d). This typically involves starting 
with a library of many (10'°-10"°) distinct RNA sequences and then applying 
iterative rounds of selection (e.g., binding to a small molecule immobilized on a 
surface or catalyzing ligation to a surface-bound ligand) and amplification (typi- 
cally involving polymerase chain reaction (PCR)). After ~10 rounds of selection 
and amplification, the activity of the remaining RNA sequences in the pool can 
be enhanced by several orders of magnitude compared with the initial library 
average [67]. Some functions may not be present in a library of 10'° RNAs; thus 
it may sometimes be necessary to chemically modify or structurally bias the 
initial library [67]. This limitation aside, in vitro selection is a useful technique 
for generating synthetic RNAs with specific functions. 

In the two decades since the development of in vitro selection, thousands of 
aptamers (oligonucleotides that bind to a particular ligand) have been developed 
[68]. These include aptamers to small molecules, peptides, and even human and 
cancer cell types [47, 67, 69-71]. In addition to RNA molecules, proteins such as 
epitopes and antibodies have been evolved using in vitro selection [72-74]. Thus, in 
vitro selection can be used to enhance functional portions of an RNA scaffold. This 
is especially useful when existing RNA parts are not sufficient for the task at hand. 


13.4 Applications of Designed RNA Scaffolds 


RNA sequences consisting of secondary structures and functional units 
designed using the tools described previously can be genetically expressed in 
cells. Such engineered RNAs have been used for tasks ranging from studying 
natural RNA processing in cells to metabolic engineering and therapeutic 
applications. 


13.4.1. Tools for RNA Research 


While mRNA has long been known to function as a template for protein transla- 
tion, the spatiotemporal aspects of the various steps involved in mRNA processing 
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remain poorly understood. Investigation of the dynamics of mRNA as it goes 
through translation, splicing, nuclear export in eukaryotes, localization for trans- 
lation, and finally degradation requires tools to track individual RNA molecules. 
Aptamers and their recruitment of fluorescent proteins on engineered mRNA 
scaffolds have enabled such studies. 

Some of the earliest attempts to tag RNA in vivo were carried out by expressing 
GFP fused with bacteriophage MS2 coat protein [75] or human RNA- interacting 
protein domain U1A [76] along with mRNA containing the corresponding bind- 
ing sites in Saccharomyces cerevisiae. Such tags enabled tracking of single-cell 
mRNA localization by microscopy. Furthermore, by incorporating tandem 
repeats of MS2 binding sites on reporter mRNA [77], several GFP—MS2 fusions 
could be localized on a single transcript, enabling tracking of individual mRNA 
molecules in mammalian cells (Figure 13.3a). This in vivo tracking method was 
extended to other systems [82], including bacteria [78, 83]. 

More recently, several efforts have addressed the long-standing question of 
whether or not RNA is highly localized within bacterial cells [84, 85]. A signifi- 
cant innovation over the previous strategy came from the use of fluorescent 
protein complementation. In this approach, RNA aptamers are used to bring 
together two different protein fusion units, each with a split fluorescent protein 
fused to an RNA-binding domain (RBD) [79, 86] (Figure 13.3a). Since only the 
scaffolded protein units are able to reconstitute the split chromophore and fluo- 
resce, they can be easily distinguished from the unbound ones. Such an approach 
hence achieves lower background signals than systems where autofluorescent 
proteins are directly tagged onto RNA. 

As the repertoire of aptamer-RNA-binding protein pairs is being extended 
through the in vitro methods described in Section 13.3.4, newer combinations 
are being used to explore cellular function [87]. The studies discussed here have 
led to a better understanding of RNA diffusion and localization [78, 79] in bacte- 
rial cells and measurement of transcriptional kinetics [88]. These efforts also 
enabled localization of a diverse array of proteins (such as enzymes) on RNA 
scaffolds, opening up applications in metabolic engineering. 


13.4.2 Localizing Metabolic Enzymes on RNA 


Scaffolding and compartmentalization are effective strategies for optimization of 
metabolic pathway performance in both natural and synthetic systems [89, 90]. 
A few studies have used DNA structures to coordinate the assembly of enzymes 
and study effects of spatial co-localization in vitro [91-94] and in vivo [95]. 
Protein scaffolds have also been used to channel metabolic substrates between 
co-localized enzymes in living cells [2, 96]. Scaffolding is seen as a powerful tool 
to specifically direct metabolic pathway flux toward enzymes of choice, prevent 
loss of intermediates to competing reactions, and protect the host cell from any 
toxic or volatile intermediates through confinement at a subcellular location. 

A notable effort in the use of RNA scaffolds for metabolic channeling achieved 
a nearly 50-fold increase in hydrogen gas production in Escherichia coli [1]. This 
effort combined many of the techniques discussed previously. Synthetic RNA 
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Figure 13.3 Applications of RNA scaffolds in vivo. (a) mRNA are modified to include either 
several repeats of an aptamer or two different aptamers in close proximity. The former 
approach results in concentrated foci of fluorescent protein fusions to RNA-binding domains 
(RBDs) [78] and in the latter, two halves of the protein with RBD fusions [79], only complement 
to be fluorescent on the mRNA scaffold. (b) Enzymes fused to RBDs localize to self-assembled 
RNA scaffolds with aptamers presented. Channeling of intermediate metabolites can lead to 
enhanced pathway flux toward biofuels or other high value products [1]. (c) Pentamer of 
bacteriophage ©29 pRNA [23] from PDB file 1FOQ. Tagging the monomers with functional 
units like siRNA can make them useful drug delivery vehicles [6, 80]. (d) The clover leaf tRNA 
sequence can be tagged with recombinant RNA and epitopes as shown to allow for its 
synthesis and purification [81]. 


(d) 


strands comprising polymerization domains and aptamers for MS2 and PP7 coat 
proteins were expressed in the bacteria. Dimerization and polymerization 
domains allowed for tiling and assembly into a macromolecular structure. The 
large (40-100nm) intracellular RNA assemblies greatly enhanced the flux of 
electrons from ferredoxin to hydrogenase when both enzymes were tethered to 
the scaffold with fusions to MS2 and PP7 (Figure 13.3b). Furthermore, significant 
differences in titer were observed for scaffolding structures having different 
geometries, tying metabolic flux to the specific spatial positioning of the scaffold. 
Such an approach brings modular design and scalability [97] to metabolic engi- 
neering for biofuels and high value chemical synthesis, where control of interme- 
diate metabolite flux can be critical [98-100]. 

There has been debate about the mechanism by which scaffolds enable meta- 
bolic substrate channeling. The transfer of electrons between enzymes relies on 
physical contact and thus is limited by protein diffusion rates and competition, 
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which are effectively addressed by scaffolding [1]. However, the role of enzyme 
co-localization in pathways involving diffusible intermediates is much less well 
understood [101, 102]. In a recent study [103], enzymes localized in close prox- 
imity, less than 30nm apart, on in vitro assembled DNA scaffolds exhibited 
enhanced rates of metabolite exchange. The transfer rates dropped precipitously 
with any further increase in interenzyme distance. Since such effects are not 
explicable by 3D diffusion models [101], a mechanism of metabolite substrate 
channeling by restricted diffusion on hydration layers across crowded protein 
surfaces has been proposed [103]. RNA scaffolds, with their predictable geome- 
try, can be used to create a range of metabolic channeling platforms and test the 
relative effects from these two different mechanisms. 


13.4.3 Packaging Therapeutics on RNA Scaffolds 


While metabolic channeling functions relied on RNA interactions with proteins, 
RNA-RNA interactions can also be used for exciting scaffold applications. pRNA 
from bacteriophage ©29 (referred to in Section 13.2) has been used as a building 
block for bottom-up assembly of drug delivery vehicles [6, 80] (Figure 13.3c). 
pRNA monomers consist of structural hairpin regions and dimerization/polym- 
erization domains. Ends of the hairpin regions offer sites for tagging with drugs 
or targeting molecules. The polymerization domains can be engineered to favor 
formation of dimers, trimers, pentamers, or hexamers as stable drug carriers 
[6, 23, 80]. Heterodimers containing pRNA tagged with a CD4 aptamer and 
pRNA attached to an siRNA were shown to specifically target CD4-expressing 
T cells, leading to cell death [80]. This in vitro study also showed stability and 
efficacy of the nanoscale drug delivery particles for killing cancer cells. Such sys- 
tems are advantageous since the pRNA polymers are hypothesized to be stable in 
physiological conditions and be less immunogenic than protein carriers [80]. 
Finally, these polymers could be made specific to many in situ targets by using 
engineered specific RNA aptamers that recognize cellular moieties. 


13.4.4 Recombinant RNA Technology 


RNA scaffolds have also been used to serve as protective tethers for the purifica- 
tion of recombinant RNA (recRNA) (Figure 13.3d) [81]. In this approach, a tRNA 
scaffold acts as a protective secondary structure to insulate the transcript from 
native E. coli nucleases and therefore stabilize production of recRNA in vivo. The 
characteristic clover leaf tRNA structure formed around a recRNA is recognized 
by native cellular enzymes and processed as tRNA. This ensures that each single 
transcript is a product of specific defined length. A Sephadex affinity tag was 
included in the expressed sequence to allow purification of transcripts that con- 
tained RNAs of medical research interest, like the human hepatitis B virus (HBV) 
epsilon [81]. This design thus enables collection of large amounts of purified RNA 
transcripts for in vitro structural studies and vaccine development. Recently, these 
efforts have been extended to expression and purification of RNA—protein com- 
plexes [104], providing pure samples that could be used for crystallographic stud- 
ies of natural RNA-protein interactions and potential use in cell-free systems. 
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13.5 Conclusion 


RNA is a powerful tool to synthetic biologists. RNA scaffolds can be composed 
of many structural, dynamic, and functional regions. Structure design can be 
predicted reliably, and there are a growing number of assays for proper structure 
assembly. In addition, recent advances in DNA construction [105, 106] have 
made it faster and easier to test new structure designs in vivo. Prediction and 
design of RNA structure in three dimensions remains a challenge. The difficulty 
of going from a secondary structure design to precise orientation of tertiary scaf- 
fold units needs to be addressed for metabolic engineering and therapeutic 
applications. Additionally, although localization of fluorophores to RNA enables 
in vivo imaging, resolution limits have prevented elucidation of precise geomet- 
ric details in RNA scaffolds and assemblies within cells. Future technical advances 
could enable many scientists to construct new RNA scaffolds for a wide range of 
purposes. In the following text, we discuss a particular set of exciting applica- 
tions and the technologies that will enable them. 


13.5.1 New Applications 


Synthetic biologists are constantly seeking to increase the complexity of their 
devices. RNA synthetic biology is offering tools to enable such control [107]. One 
particular goal is the construction of orthogonal ribosomes [108], capable of 
incorporating nonnatural amino acids wherein altered tR NA—protein interactions 
enable an expanded genetic code [109]. RNA scaffolds are also being employed to 
devise more precise genome editing tools [110]. For metabolic engineering appli- 
cations, RNA scaffolds are enabling control over the relative geometric orienta- 
tions of enzymes in a co-localized pathway, which can lead to better channeling of 
volatile intermediate metabolites [111]. Therapeutic applications of in vivo RNA 
scaffolds include functionalizing natural RNA scaffolds to enable drug delivery or 
isolation of pure samples. Similar developments in the fields of DNA packaging 
and origami for drug delivery [112, 113] could offer strong synergistic opportuni- 
ties for clinically applicable technologies to be implemented. More generally, the 
ability to simulate and predict the dynamics of structure-receptor binding interac- 
tions should enhance the design of such therapeutics [1141]. 


13.5.2 Technological Advances 


Moving forward, innovations in high-throughput design, synthesis, and assaying 
functions for RNA structures will enable a greater range of applications to be 
developed. In silico design software packages are continuously improving their 
capabilities, making it possible to computationally generate increasingly compli- 
cated structures [55]. In addition to the advances for in vivo synthesis and purifi- 
cation of RNAs mentioned previously, developments in chip-based synthesis 
could enable hundreds of RNA designs to be synthesized in vitro at a time 
[106, 115]. This, coupled with new structure assembly assays such as SHAPE-Seq 
[116] and improved genetically encodable electron microscopy tags [117, 118], 
will greatly simplify the testing of more complicated structures. Developments in 
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RNA imaging [119] can be further advanced by incorporation of docking sites 
that allow RNA to be probed with oligonucleotides using methods like DNA- 
PAINT [120], leading to super-resolution imaging in situ. 

Thus, the discovery of a variety of natural RNA structures and functions, an 
ever-increasing understanding of how such features can be designed, and an 
ability to rapidly implement and test ideas are indicators of a significant role for 
RNA scaffolds in future synthetic biology applications. 


Definitions 


Synthetic biology is a discipline that seeks to control biology using the princi- 
ples of engineering 

Nanotechnology is the manipulation of matter at the atomic, molecular, and 
supramolecular scale 

RNA scaffolds are macromolecular structures or assemblies of RNA with well- 
defined secondary structure motifs for spatially organizing other biomole- 
cules. These are typically expressed in living cells for metabolic engineering 
purposes 

Isothermal assembly is a self-assembly of structures at a constant temperature 

Metabolic engineering is the production of small molecules or short peptides 
through the engineering of metabolic pathways 

Aptamers are nucleic acid oligonucleotides that bind a specific small molecule 
or other ligand 


References 


1 Delebecque, C.J., Lindner, A.B., Silver, P.A., and Aldaye, F.A. (2011) Organization 
of intracellular reactions with rationally designed RNA assemblies. Science, 
333 (6041), 470-474. 

2 Dueber, J.E., Wu, G.C., Malmirchegini, G.R., Moon, T.S., Petzold, C.J., Ullal, 
AV., Prather, K.L.J., and Keasling, J.D. (2009) Synthetic protein scaffolds provide 
modular control over metabolic flux. Nat. Biotechnol., 27 (8), 753-759. 

3 Isaacs, F.J., Dwyer, D.J., and Collins, J.J. (2006) RNA synthetic biology. Nat. 
Biotechnol., 24 (5), 545-554, 

4 Culler, S.J., Hoff, K.G., and Smolke, C.D. (2010) Reprogramming cellular 
behavior with RNA controllers responsive to endogenous proteins. Science, 330 
(6008), 1251-1255. 

5 Qi, L.S., Larson, M.H., Gilbert, L.A., Doudna, J.A., Weissman, J.S., Arkin, A.P., 
and Lim, W.A. (2013) Repurposing CRISPR as an RNA-guided platform for 
sequence-specific control of gene expression. Cell, 152 (5), 1173-1183. 

6 Khaled, A., Guo, S., Li, F, and Guo, P. (2005) Controllable self-assembly of 
nanoparticles for specific delivery of multiple therapeutic molecules to cancer 
cells using RNA nanotechnology. Nano Lett., 5 (9), 1797-1808. 

7 Aldaye, F.A., Senapedis, WT,, Silver, P.A., and Way, J.C. (2010) A structurally 
tunable DNA-based extracellular matrix. J. Am. Chem. Soc., 132 (42), 14727-14729. 


271 


272 


13 Synthetic RNA Scaffolds for Spatial Engineering in Cells 


8 


10 


17 


18 


19 


20 


21 


22 


23 


24 


25 


26 


27 


Lorenz, R., Bernhart, S.H., Siederdissen, C.H.Z., Tafer, H., Flamm, C., Stadler, PF, 
and Hofacker, I.L. (2011) ViennaRNA Package 2.0. Algorithms Mol. Biol., 6 (1), 26. 
Markham, N.R. and Zuker, M. (2008) UNAFold: software for nucleic acid folding 
and hybridization. Methods Mol. Biol., 453, 3-31. 

Reuter, J.S. and Mathews, D.H. (2010) RNA structure: software for RNA 
secondary structure prediction and analysis. BMC Bioinf., 11 (1), 129. 

Zadeh, J.N., Steenberg, C.D., Bois, J.S., Wolfe, B.R., Pierce, M.B., Khan, A.R., 
Dirks, R.M., and Pierce, N.A. (2010) NUPACK: analysis and design of nucleic 
acid systems. J, Comput. Chem., 32 (1), 170-173. 

Leontis, N.B., Lescoute, A., and Westhof, E. (2006) The building blocks and 
motifs of RNA architecture. Curr. Opin. Struct. Biol., 16 (3), 279-287. 

Jaeger, L. and Chworos, A. (2006) The architectonics of programmable RNA and 
DNA nanostructures. Curr. Opin. Struct. Biol., 16 (4), 531-543. 

Cruz, J.A. and Westhof, E. (2009) The dynamic landscapes of RNA architecture. 
Cell, 136 (4), 604-609. 

Tinoco, I. Jr. and Bustamante, C. (1999) How RNA folds. J. Mol. Biol., 293 (2), 
271-281. 

Lim, F.F., Downey, T.P-T., and Peabody, D.S.D. (2001) Translational repression 
and specific RNA binding by the coat protein of the Pseudomonas phage PP7. 

J. Biol. Chem., 276 (25), 22507-22513. 

Waters, L.S. and Storz, G. (2009) Regulatory RNAs in bacteria. Cell, 136 (4), 
615-628. 

Winkler, W.C. and Breaker, R.R. (2005) Regulation of bacterial gene expression 
by riboswitches. Annu. Rev. Microbiol., 59, 487-517. 

Nudler, E. and Mironov, A.S. (2004) The riboswitch control of bacterial 
metabolism. Trends Biochem. Sci, 29 (1), 11-17. 

Hirao, L., Spingola, M., Peabody, D., and Ellington, A.D. (1998) The limits of 
specificity: an experimental analysis with RNA aptamers to MS2 coat protein 
variants. Mol. Diversity, 4 (2), 75-89. 

Witherell, G.W. and Uhlenbeck, O.C. (1989) Specific RNA binding by Q.beta. 
coat protein. Biochemistry, 28 (1), 71-76. 

Guo, P., Erickson, S., and Anderson, D. (1987) A small viral RNA is required for in 
vitro packaging of bacteriophage phi 29 DNA. Science, 236 (4802), 690-694. 
Simpson, A.A., Tao, Y., Leiman, P.G., Badasso, M.O., He, Y., Jardine, PJ., Olson, 
N.H., Morais, M.C., Grimes, S., Anderson, D.L., Baker, T.S., and Rossmann, 
M.G. (2000) Structure of the bacteriophage phi29 DNA packaging motor. 
Nature, 408 (6813), 745-750. 

Ni, C.-Z., Syed, R., Kodandapani, R., Wickersham, J., Peabody, D.S., and Ely, K.R. 
(1995) Crystal structure of the MS2 coat protein dimer: implications for RNA 
binding and virus assembly. Structure, 3 (3), 255-263. 

Peabody, D.S. and Ely, K.R. (1992) Control of translational repression by 
protein-protein interactions. Nucleic Acids Res., 20 (7), 1649-1655. 

Guo, P., Zhang, C., Chen, C., Garver, K., and Trottier, M. (1998) Inter-RNA 
interaction of phage @29 pRNA to form a hexameric complex for viral DNA 
transportation. Mol. Cell, 2 (1), 149-155. 

Underwood, J.G., Uzilov, A.V., Katzman, S., Onodera, C.S., Mainzer, J.E., 
Mathews, D.H., Lowe, T.M., Salama, S.R., and Haussler, D. (2010) FragSeq: 


28 


29 


30 


31 


32 


33 


34 


35 


36 


37 


38 


39 


40 


41 


42 


43 


44 


45 


References 


transcriptome-wide RNA structure probing using high-throughput sequencing. 
Nat. Methods, 7 (12), 995-1001. 

Kertesz, M., Wan, Y., Mazor, E., Rinn, J.L., Nutter, R.C., Chang, H.Y., and Segal, 
E. (2010) Genome-wide measurement of RNA secondary structure in yeast. 
Nature, 467 (7311), 103-107. 

Mercer, T.R. and Mattick, J.S. (2013) Structure and function of long 

noncoding RNAs in epigenetic regulation. Nat. Struct. Mol. Biol., 20 (3), 300-307. 
Tsai, M.C., Manor, O., Wan, Y., Mosammaparast, N., Wang, J.K., Lan, F., Shi, Y., 
Segal, E., and Chang, H.Y. (2010) Long noncoding RNA as modular scaffold of 
histone modification complexes. Science, 329 (5992), 689-693. 

Zappulla, D.C. and Cech, T.R. (2004) Yeast telomerase RNA: a flexible 

scaffold for protein subunits. Proc. Natl. Acad. Sci. U.S.A., 101 (27), 
10024-10029. 

Theimer, C.A. and Feigon, J. (2006) Structure and function of telomerase RNA. 
Curr. Opin. Struct. Biol., 16 (3), 307-318. 

Arnold, F.H. (2001) Combinatorial and computational challenges for biocatalyst 
design. Nature, 409 (6817), 253-257. 

Bonneau, R. and Baker, D. (2001) Ab initio protein structure prediction: 
progress and prospects. Annu. Rev. Biophys. Biomol. Struct., 30, 173-189. 
Bornscheuer, U.T., Huisman, G.W., Kazlauskas, R.J., Lutz, S., Moore, J.C., and 
Robins, K. (2012) Engineering the third wave of biocatalysis. Nature, 485 (7397), 
185-194. 

Watson, J.D. and Crick, F.H. (1953) Molecular structure of nucleic acids. Nature, 
171 (4356), 737-738. 

Varani, G. and McClain, W.H. (2000) The G- U wobble base pair. EMBO Rep., 1 
(1), 18-23. 

Lipps, H.J. and Rhodes, D. (2009) G-quadruplex structures: in vivo evidence and 
function. Trends Cell Biol., 19 (8), 414-422. 

Guerrier-Takada, C., Gardiner, K., Marsh, T., Pace, N., and Altman, S. (1983) 
The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell, 
35 (3), 849-857. 

Kruger, K., Grabowski, P.J., Zaug, A.J., Sands, J., Gottschling, D.E., and Cech, 
T.R. (1982) Self-splicing RNA: autoexcision and autocyclization of the ribosomal 
RNA intervening sequence of Tetrahymena. Cell, 31 (1), 147-157. 

Nahvi, A., Sudarsan, N., Ebert, M.S., Zou, X., Brown, K.L., and Breaker, R.R. 
(2002) Genetic control by a metabolite binding mRNA. Chem. Biol., 9 (9), 1043. 
Winkler, W., Nahvi, A., and Breaker, R.R. (2002) Thiamine derivatives bind 
messenger RNAs directly to regulate bacterial gene expression. Nature, 419 
(6910), 952-956. 

Afonin, K.A., Bindewald, E., Yaghoubian, A.J., Voss, N., Jacovetty, E., Shapiro, 
B.A., and Jaeger, L. (2010) In vitro assembly of cubic RNA-based scaffolds 
designed in silico. Nat. Nanotechnol., 5 (9), 676-682. 

Chworos, A., Severcan, I., Koyfman, A.Y., Weinkam, P., Oroudjev, E., Hansma, 
H.G., and Jaeger, L. (2004) Building programmable Jigsaw puzzles with RNA. 
Science, 306 (5704), 2068-2072. 

Dibrov, S.M., McLean, J., Parsons, J., and Hermann, T. (2011) Self-assembling 
RNA square. Proc. Natl. Acad. Sci. U.S.A., 108 (16), 6405-6408. 


273 


274 


13 Synthetic RNA Scaffolds for Spatial Engineering in Cells 


46 


47 


48 


49 


50 


51 


52 


53 


54 


55 


56 


57 


58 


59 


60 


61 


62 


63 


Severcan, I, Geary, C., Chworos, A., Voss, N., Jacovetty, E., and Jaeger, L. (2010) 
A polyhedron made of tRNAs. Nat. Chem., 2 (9), 772-779. 

Ellington, A.D. and Szostak, J.W. (1990) In vitro selection of RNA molecules that 
bind specific ligands. Nature, 346 (6287), 818-822. 

Tuerk, C. and Gold, L. (1990) Systematic evolution of ligands by exponential 
enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science, 249 
(4968), 505-510. 

Walter, A.E., Turner, D.H., Kim, J., Lyttle, M.H., Miiller, P., Mathews, D.H., and 
Zuker, M. (1994) Coaxial stacking of helixes enhances binding of 
oligoribonucleotides and improves predictions of RNA folding. Proc. Natl. Acad. 
Sci. U.S.A., 91 (20), 9218-9222. 

Mathews, D.H., Sabina, J., Zuker, M., and Turner, D.H. (1999) Expanded 
sequence dependence of thermodynamic parameters improves prediction of 
RNA secondary structure. J, Mol. Biol., 288 (5), 911-940. 

Isaacs, F.J., Dwyer, D.J., Ding, C., Pervouchine, D.D., Cantor, C.R., and Collins, 
J.J. (2004) Engineered riboregulators enable post-transcriptional control of gene 
expression. Nat. Biotechnol., 22 (7), 841-847. 

Bayer, T.S. and Smolke, C.D. (2005) Programmable ligand-controlled 
riboregulators of eukaryotic gene expression. Nat. Biotechnol., 23 (3), 
337-343. 

Lou, C., Stanton, B., Chen, Y.-J., Munsky, B., and Voigt, C.A. (2012) Ribozyme- 
based insulator parts buffer synthetic circuits from genetic context. 

Nat. Biotechnol., 30 (11), 1137-1142. 

Win, M.N. and Smolke, C.D. (2007) Targeted cleavage: tuneable cis-cleaving 
ribozymes. Proc. Natl. Acad. Sci. U.S.A., 104 (38), 14881-14882. 

Zadeh, J.N., Wolfe, B.R., and Pierce, N.A. (2010) Nucleic acid sequence design 
via efficient ensemble defect optimization. J. Comput. Chem., 32 (3), 439-452. 
Merino, E.J., Wilkinson, K.A., Coughlan, J.L., and Weeks, K.M. (2005) RNA 
structure analysis at single nucleotide resolution by selective 2’-hydroxyl acylation 
and primer extension (SHAPE). J. Am. Chem. Soc., 127 (12), 4223-4231. 

Zuker, P.S.M. (1981) Optimal computer folding of large RNA sequences using 
thermodynamics and auxiliary information. Nucleic Acids Res., 9 (1), 133. 
Dirks, R.M., Bois, J.S., Schaeffer, J.M., Winfree, E., and Pierce, N.A. (2007) 
Thermodynamic analysis of interacting nucleic acid strands. SIAM Rev., 49 (1), 
65-88. 

Jaeger, L. and Leontis, N. (2000) Tecto-RNA: one-dimensional self-assembly 
through tertiary interactions Angew. Chem. Int. Ed., 39 (14), 2521-2524. 
Khisamutdinov, E.F., Jasinski, D.L., and Guo, P. (2014) RNA as a boiling-resistant 
anionic polymer material to build robust structures with defined shape and 
stoichiometry. ACS Nano, 8 (5), 4771-4781. 

Wei, B., Dai, M., and Yin, P. (2012) Complex shapes self-assembled from 
single-stranded DNA tiles. Nature, 485 (7400), 623-626. 

Ke, Y., Ong, L.L., Shih, W.M., and Yin, P. (2012) Three-dimensional structures 
self-assembled from DNA bricks. Science, 338 (6111), 1177-1183. 

Myhrvold, C., Dai, M., Silver, P.A., and Yin, P. (2013) Isothermal self-assembly 
of complex DNA structures under diverse and biocompatible conditions. 

Nano Lett., 13 (9), 4242-4248. 


64 


65 


66 


67 


68 


69 


70 


71 


72 


73 


74 


75 


76 


77 


78 


79 


80 


References 


Win, M.N. and Smolke, C.D. (2007) A modular and extensible RNA-based 
gene-regulatory platform for engineering cellular function. Proc. Natl. Acad. Sci. 
U.S.A., 104 (36), 14283-14288. 

Beisel, C.L., Bayer, T.S., Hoff, K.G., and Smolke, C.D. (2008) Model-guided 
design of ligand-regulated RNAi for programmable control of gene expression. 
Mol. Syst. Biol., 4 (224). 

Qi, L., Lucks, J.B., Liu, C.C., Mutalik, V.K., and Arkin, A.P. (2012) Engineering 
naturally occurring trans-acting non-coding RNAs to sense molecular signals. 
Nucleic Acids Res., 40 (12), 5775-5786. 

Wilson, D.S. and Szostak, J.W. (1999) In vitro selection of functional nucleic 
acids. Annu. Rev. Biochem., 68 (1), 611-647. 

Lee, J.F., Hesselberth, J.R., Meyers, L.A., and Ellington, A.D. (2004) Aptamer 
database. Nucleic Acids Res., 32 (Database issue), D95—D100. 

Davis, J.H. and Szostak, J.W. (2002) Isolation of high-affinity GTP aptamers from 
partially structured RNA libraries. Proc. Natl. Acad. Sci. U.S.A., 99 (18), 
11616-11621. 

Colas, P., Cohen, B., Jessen, T., Grishina, I., McCoy, J., and Brent, R. (1996) 
Genetic selection of peptide aptamers that recognize and inhibit cyclin- 
dependent kinase 2. Nature, 380 (6574), 548-550. 

Shangguan, D., Li, Y., Tang, Z., Cao, Z.C., Chen, H.W., Mallikaratchy, P., 

Sefah, K., Yang, C.J., and Tan, W. (2006) Aptamers evolved from live cells as 
effective molecular probes for cancer study. Proc. Natl. Acad. Sci. U.S.A., 103 
(32), 11838-11843. 

Hanes, J. and Pliickthun, A. (1997) In vitro selection and evolution of functional 
proteins by using ribosome display. Proc. Natl. Acad. Sci. U.S.A., 94 (10), 4937-4942. 
Roberts, R.W. and Szostak, J.W. (1997) RNA-peptide fusions for the in vitro 
selection of peptides and proteins. Proc. Natl. Acad. Sci. U.S.A., 94 (23), 
12297-12302. 

Bayer, T.S., Booth, L.N., Knudsen, S.M., and Ellington, A.D. (2005) Arginine-rich 
motifs present multiple interfaces for specific binding by RNA. RNA, 11 (12), 
1848-1857. 

Bertrand, E., Chartrand, P., Schaefer, M., Shenoy, S.M., Singer, R.H., and Long, 
R.M. (1998) Localization of ASH1 mRNA particles in living yeast. Mol. Cell, 2 
(4), 437-445. 

Brodsky, A.S. and Silver, P.A. (2000) Pre-mRNA processing factors are required 
for nuclear export. RNA, 6 (12), 1737-1749. 

Fusco, D., Accornero, N., Lavoie, B., Shenoy, S.M., Blanchard, J.-M., Singer, R.H., 
and Bertrand, E. (2003) Single mRNA molecules demonstrate probabilistic 
movement in living mammalian cells. Curr. Biol., 13 (2), 161-167. 

Golding, I. and Cox, E.C. (2004) RNA dynamics in live Escherichia coli cells. 
Proc. Natl. Acad. Sci. U.S.A., 101 (31), 11310-11315. 

Valencia-Burton, M., McCullough, R.M., Cantor, C.R., and Broude, N.E. (2007) 
RNA visualization in live bacterial cells using fluorescent protein 
complementation. Nat. Methods, 282 (5387), 296-298. 

Guo, S., Tschammer, N., Mohammed, S., and Guo, P. (2005) Specific delivery of 
therapeutic RNAs to cancer cells via the dimerization mechanism of phi29 
motor pRNA. Hum. Gene Ther., 16 (9), 1097-1109. 


275 


276 


13 


8 


= 


82 


83 


84 


85 


86 


87 


88 


89 


90 


9 


= 


92 


93 


94 


95 


96 


Synthetic RNA Scaffolds for Spatial Engineering in Cells 


Ponchon, L. and Dardel, F. (2007) Recombinant RNA technology: the tRNA 
scaffold. Nat. Methods, 4 (7), 571-576. 

Schifferer, M. and Griesbeck, O. (2009) Application of aptamers and 
autofluorescent proteins for RNA visualization. Integr. Biol., 1 (8), 499-505. 

Le, T.T., Harlepp, S., Guet, C.C., Dittmar, K., Emonet, T., Pan, T., and Cluzel, P. 
(2005) Real-time RNA profiling within a single bacterium. Proc. Natl. Acad. Sci. 
U.S.A., 102 (2), 9160-9164. 

Keiler, K.C. (2011) RNA localization in bacteria. Curr. Opin. Microbiol., 14 (2), 
155-159. 

Broude, N.E. (2011) Analysis of RNA localization and metabolism in single live 
bacterial cells: achievements and challenges. Mol. Microbiol., 80 (5), 1137-1147. 
Ozawa, T., Natori, Y., Sato, M., and Umezawa, Y. (2007) Imaging dynamics of 
endogenous mitochondrial RNA in single living cells. Nat. Methods, 4 (5), 
413-419. 

Yiu, H.-W., Demidov, V-V., Toran, P., Cantor, C.R., and Broude, N.E. (2011) RNA 
detection in live bacterial cells using fluorescent protein complementation 
triggered by interaction of two RNA aptamers with two RNA-binding peptides. 
Pharmaceuticals, 4 (3), 494—508. 

Valencia-Burton, M., Shah, A., Sutin, J., Borogovac, A., McCullough, R.M., 
Cantor, C.R., Meller, A., and Broude, N.E. (2009) Spatiotemporal patterns and 
transcription kinetics of induced RNA in single bacterial cells. Proc. Natl. Acad. 
Sci. U.S.A., 106 (38), 16399-16404. 

Agapakis, C.M., Boyle, P.M., and Silver, P.A. (2012) Natural strategies for the 
spatial optimization of metabolism in synthetic biology. Nat. Chem. Biol., 8 (6), 
527-535. 

Chen, A.H. and Silver, P.A. (2012) Designing biological compartmentalization. 
Trends Cell Biol., 22 (12), 662-670. 

Erkelenz, M., Kuo, C.-H., and Niemeyer, C.M. (2011) DNA-mediated assembly 
of cytochrome P450 BM3 subdomains. J, Am. Chem. Soc., 133 (40), 
16111-16118. 

Liu, M., Fu, J., Hejesen, C., Yang, Y., Woodbury, N.W., Gothelf, K., Liu, Y., and 
Yan, H. (2013) A DNA tweezer-actuated enzyme nanoreactor. Nat. Commun., 
4, 2127. 

Niemeyer, C.M., Koehler, J., and Wuerdemann, C. (2002) DNA-directed 
assembly of bienzymic complexes from in vivo biotinylated NAD (P) H: FMN 
oxidoreductase and luciferase. ChemBioChem, 3 (2), 242-245. 

You, M., Wang, R.-W., Zhang, X., Chen, Y., Wang, K., Peng, L., and Tan, W. 
(2011) Photon-regulated DNA-enzymatic nanostructures by molecular 
assembly. ACS Nano, 5 (12), 10090-10095. 

Conrado, R.J., Wu, G.C., Boock, J.T., Xu, H., Chen, S.Y., Lebar, T., Turnsek, J., 
Tomsic, N., Avbelj, M., Gaber, R., Koprivnjak, T., Mori, J., Glavnik, V., Vovk, L., 
Bencina, M., Hodnik, V., Anderluh, G., Dueber, J.E., Jerala, R., and DeLisa, M.P. 
(2012) DNA-guided assembly of biosynthetic pathways promotes improved 
catalytic efficiency. Nucleic Acids Res., 40 (4), 1879-1889. 

Moon, T.S., Dueber, J.E., Shiue, E., Prather, K.L.J., and Prather, K.L. (2010) Use of 
modular, synthetic scaffolds for improved production of glucaric acid in 
engineered E. coli. Metab. Eng., 12 (3), 298-305. 


97 


98 


99 


100 


101 


102 


103 


104 


105 


106 


107 


108 


109 


110 


113 


114 


References 


Delebecque, C.J., Silver, P.A., and Lindner, A.B. (2012) Designing and using 
RNA scaffolds to assemble proteins in vivo. Nat. Protoc., 7 (10), 1797-1807. 
Ducat, D.C., Sachdeva, G., and Silver, P.A. (2011) Rewiring hydrogenase- 
dependent redox circuits in cyanobacteria. Proc. Natl. Acad. Sci. U.S.A., 108 
(10), 3941-3946. 

Schirmer, A., Rude, M.A., Li, X., Popova, E., and del Cardayre, S.B. (2010) 
Microbial biosynthesis of alkanes. Science, 329 (5991), 559-562. 

Torella, J.P., Ford, T.J., Kim, S.N., Chen, A.M., Way, J.C., and Silver, P.A. (2013) 
Tailored fatty acid synthesis via dynamic control of fatty acid elongation. 
Proc. Natl. Acad. Sci. U.S.A., 110 (28), 11290-11295. 

Barros, L.F. and Martinez, C. (2007) An enquiry into metabolite domains. 
Biophys. J., 92 (11), 3878-3884. 

Lee, H., DeLoache, W.C., and Dueber, J.E. (2012) Spatial organization of 
enzymes for metabolic engineering. Metab. Eng., 14 (3), 242-251. 

Fu, J., Liu, M., Liu, Y., Woodbury, N.W., and Yan, H. (2012) Interenzyme 
substrate diffusion for an enzyme cascade organized on spatially addressable 
DNA nanostructures. J. Am. Chem. Soc., 134 (12), 5516-5519. 

Ponchon, L., Catala, M., Seijo, B., El Khouri, M., Dardel, F., Nonin-Lecomte, S., 
and Tisne, C. (2013) Co-expression of RNA-protein complexes in Escherichia 
coli and applications to RNA biology. Nucleic Acids Res., 41 (15), e150. 
Gibson, D.G., Young, L., Chuang, R.-Y., Venter, J.C., Hutchison, C.A., and 
Smith, H.O. (2009) Enzymatic assembly of DNA molecules up to several 
hundred kilobases. Nat. Methods, 6 (5), 343-345. 

Kosuri, S., Eroshenko, N., LeProust, E.M., Super, M., Way, J., Li, J.B., and 
Church, G.M. (2010) Scalable gene synthesis by selective amplification of DNA 
pools from high-fidelity microchips. Nat. Biotechnol., 28 (12), 1295-1299. 
Liang, J.C., Bloom, R.J., and Smolke, C.D. (2011) Engineering biological 
systems with synthetic RNA molecules. Mol. Cell, 43 (6), 915-926. 

Wang, K., Neumann, H., Peak-Chew, S.Y., and Chin, J.W. (2007) Evolved 
orthogonal ribosomes enhance the efficiency of synthetic genetic code 
expansion. Nat. Biotechnol., 25 (7), 770-777. 

Neumann, H., Wang, K., Davis, L., Garcia-Alai, M., and Chin, J.W. (2010) 
Encoding multiple unnatural amino acids via evolution of a quadruplet- 
decoding ribosome. Nature, 464 (7287), 441-444. 

Mali, P., Yang, L., Esvelt, K.M., Aach, J., Guell, M., DiCarlo, J.E., Norville, J.E., 
and Church, G.M. (2013) RNA-guided human genome engineering via Cas9. 
Science, 339 (6121), 823-826. 

Sachdeva, G., Garg, A., Godding, D., Way, J., and Silver, P.A. (2014) In vivo 
co-localization of enzyme on RNA-scaffolds increases metabolic production in 
a geometrically dependent manner. Nucleic Acids Res. doi: 10.1093/nar/gku617 
Douglas, S.M., Bachelet, I., and Church, G.M. (2012) A logic-gated nanorobot 
for targeted transport of molecular payloads. Science, 335 (6070), 831-834. 
Fu, J. and Yan, H. (2012) Controlled drug release by a nanorobot. Nat. 
Biotechnol., 30 (5), 407-408. 

Robinson-Mosher, A., Shinar, T., Silver, P.A., and Way, J. (2013) Dynamics 
simulations for engineering macromolecular interactions. Chaos, 23 (2), 
025110. 


277 


278 | 13 Synthetic RNA Scaffolds for Spatial Engineering in Cells 


115 


116 


117 


118 


119 


120 


Wu, C.-H., Lockett, M.R., and Smith, L.M. (2012) RNA-mediated gene 
assembly from DNA arrays. Angew. Chem. Int. Ed., 51 (19), 4628-4632. 

Lucks, J.B., Mortimer, S.A., Trapnell, C., Luo, S., Aviran, S., Schroth, G.P., 
Pachter, L., Doudna, J.A., and Arkin, A.P. (2011) Multiplexed RNA structure 
characterization with selective 2'-hydroxyl acylation analyzed by primer 
extension sequencing (SHAPE-Seq). Proc. Natl. Acad. Sci. U.S.A., 108 (27), 
11063-11068. 

Shu, X., Lev-Ram, V., Deerinck, T.J., Qi, Y., Ramko, E.B., Davidson, M.W., Jin, 
Y., Ellisman, M.H., and Tsien, R.Y. (2011) A genetically encoded tag for 
correlated light and electron microscopy of intact cells, tissues, and organisms. 
PLoS Biol., 9 (4), e1001041. 

Martell, J.D., Deerinck, T.J., Sancak, Y., Poulos, T.L., Mootha, V.K., Sosinsky, 
G.E., Ellisman, M.H., and Ting, A.Y. (2012) Engineered ascorbate peroxidase as 
a genetically encoded reporter for electron microscopy. Nat. Biotechnol., 1-9. 
Choi, H.M.T., Beck, V.A., and Pierce, N.A. (2014) Next-generation in situ 
hybridization chain reaction: higher gain, lower cost, greater durability. ACS 
Nano, 8 (5), 4284-4294. 

Jungmann, R., Avendafio, M.S., Woehrstein, J.B., Dai, M., Shih, W.M., and Yin, 
P. (2014) Multiplexed 3D cellular super-resolution imaging with DNA-PAINT 
and exchange-PAINT. Nat. Methods, 11 (3), 313-318. 


14 


Sequestered: Design and Construction of Synthetic 
Organelles 
Thawatchai Chaijarasphong' and David F. Savage”? 


' Mahidol University, Faculty of Science, Department of Biotechnology, Rama VI Rd., Bangkok 10400, Thailand 
? University of California, Department of Molecular and Cell Biology, 2151 Berkeley Way, Berkeley, CA 94720, USA 
3 University of California, Department of Chemistry, 2151 Berkeley Way, Berkeley, CA 94720, USA 


14.1. Introduction 


Spatial organization is a design principle of life. At the most basic level, compart- 
mentalization defines the living contents within an organism from the nonliving 
extracellular milieu. Inside the cell, the sequestration of processes into distinct 
organelles and spaces is a common strategy for enabling competing pathways. 
Compartmentalization therefore allows for concurrent metabolic processes that 
are thermodynamically out of equilibrium with each other. The chemiosmotic 
proton-motive force is a classic example, in which protons are pumped from the 
matrix of the mitochondrion into the intermembrane space, using free energy 
derived from electron transfer [1]. By exquisitely regulating this gradient, the cell 
can capture its stored energy to synthesize adenosine triphosphate (ATP). 
Collapse of the gradient to equilibrium eliminates the mitochondria’s ability to 
synthesize ATP and results in cell death. 

From a biocatalysis point of view, compartmentalization creates a number of 
potential advantages for the engineer. First, it offers an additional way to regulate 
pathways [2]. Metabolites can be marked for specific processes in a regulated 
fashion, such as in the case of fatty acid oxidation and synthesis, which use 
orthogonal pools of fatty acyl-coenzyme A or fatty acyl—acyl carrier protein 
(ACP), respectively. Enzymes can also be selectively regulated via localization, 
such as in the glycosome, a peroxisome-derived organelle found in protozoa [3]. 
As its name suggests, the glycosome sequesters the first seven enzymes of glyco- 
lysis into a separate compartment. Its function appears to be regulatory. The 
glycosomal enzymes do not possess typical allosteric regulation (e.g., feedback 
inhibition of phosphofructokinase), and it is thought that compartmentalization 
achieves the same effect [4]. 
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Co-localization of a pathway also ensures substrate channeling of intermedi- 
ates between enzymatic steps to improve both kinetics and yield and reduce 
host toxicity [5]. Channeling commonly occurs in multifunctional enzymes 
where a labile or toxic molecule is passed from one active site to another via a 
protein channel. Examples include tryptophan synthase (indole intermediate), 
acetyl-CoA synthase/carbon monoxide dehydrogenase (carbon monoxide), and 
carbamoyl phosphate synthase (ammonia) [6—8]. Similar mechanisms occur in 
bacterial microcompartments (BMCs), large proteinaceous shells that encapsu- 
late short metabolic pathways. These will be discussed in greater detail later, 
but briefly, various evidence suggests these protein complexes are able to 
sequester/channel both volatile substrates (CO2, acetaldehyde) and those 
potentially toxic (propanal) to the rest of the cell [9-11]. In a related context, it 
is important to note that self-assembly of enzymes and pathways into large 
complexes is more common than previously realized. It is perhaps an inevitable 
outcome of the fact that metabolic enzymes are highly expressed and allosteri- 
cally regulated [12]. Thus, substrate channeling is often critical to metabolic 
pathway function. 

Recent synthetic biological efforts have leveraged these principles for improved 
biocatalysis. The goal of metabolic engineering is to produce important chemi- 
cals, such as pharmaceuticals, materials, and biofuels, from cheap and sustain- 
able biomass [13, 14]. Doing so requires high productivities and yields for 
engineered pathways, but this optimization is often counter to the growth and 
fitness of the host organism. Drawing inspiration from nature, one promising 
metabolic engineering strategy is to repurpose organelles or protein complexes 
as cellular factories for improving the performance of engineered pathways -— in 
other words, to engineer synthetic organelles. In a striking example of this strat- 
egy, Dueber and colleagues have engineered scaffold proteins from the yeast 
mitogen-activating signaling cascade as an enzymatic assembly line to improve 
production titers of the isoprenoid precursor pathway nearly 80-fold while 
reducing intermediate toxicity [15]. Similarly, Sachdeva et al. improved synthesis 
of pentadecane from fatty acyl-ACP by co-localizing fatty acyl-ACP reductase 
and aldehyde-deformylating oxygenase to an RNA scaffold, providing a strategy 
for optimizing microbial biofuel production [16]. 

Building upon the idea of enhancing pathway flux through co-localization, 
various molecular chassis and metabolic engineering strategies have been devel- 
oped to facilitate catalysis. To this end, here we review recent advances and open 
questions in the engineering and use of synthetic organelles for bioengineering 
applications (Figure 14.1). As most research heretofore has centered on metabo- 
lism, our focus is largely on metabolic engineering applications. The physical 
composition of an organelle —whether it is made from lipids or proteins — 
profoundly shapes potential uses, so our review is conceptually broken down 
into these two areas. Finally, it should be noted that the compartmentalization of 
engineering metabolism spans many orders of magnitude, from metabolically 
engineered cocultures of microbes down to single enzymes (Figure 14.1). We will 
focus on the middle regime, from protein compartments to repurposing existing 
organelles, but direct the engaged reader to previous reviews focused on the 
former [17] and latter [18, 19]. 


14.2 On Organelles 
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Figure 14.1 Possible strategies for engineering a synthetic organelle. Complexity of intra- and 
intercellular spatial organization spans from enzymes with inherent substrate channeling to 
symbiotic cocultures. This review highlights work in the middle ground, from 
nanocompartments to repurposed organelles. 


14.2. On Organelles 


Advances in imaging and comparative genomics have muddled the latter twenti- 
eth-century definition of an organelle as a specialized lipid-enclosed compart- 
ment found only in eukaryotes [20, 21]. It is now clear that many prokaryotes 
contain topologically distinct membrane compartments and that proteinaceous 
BMCs also found in prokaryotes possess metabolic features similar to complex 
structures such as mitochondria [9, 22]. In the early years of light microscopy, 
beginning with Mobius in the late 1800s, many cytoplasmic features including 
ribosomes, flagella, and the centriole were labeled with the diminutive organelle. 
Given recent results and historically ambiguity, we therefore propose to adapt a 
more relaxed definition in the context of this review: an organelle is simply a 
physically delimited compartment within the cell. 

An alternative viewpoint, particularly for the synthetic biologist, is to ask what 
is required to repurpose an existing organelle or construct one de novo. In this 
light, four important intertwined, but distinct, themes emerge (Figure 14.2). The 
first is targeting. To accomplish orthogonal function in a specific compartment, 
it is essential to have selective targeting of biochemical activities (i.e., typically 
enzymes). Nature widely leverages the specificity inherent to protein-protein 
interactions through the use of signal sequences. Engineering an organelle 
requires extensive knowledge of targeting, specificity of this process, and ideally, 
how the stoichiometry of targeted components can be adjusted to control 
activity of individual proteins. 
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Figure 14.2 Four core design principles for a synthetic organelle. 


Catalysis is often the motivating factor for organelle engineering, leading to 
two linked properties—compartment permeability and its inherent chemical 
environment. Permeability is the selectivity of the surrounding membrane or 
protein shell that directly affects what can diffuse across or be transported in and 
out of the compartment. In lipid-based organelles, selectivity is modulated by 
the nature of the membrane lipid content and types and specificities of integral 
membrane transporters or channels. In proteinaceous organelles, it is simply a 
function of the shell’s diffusive permeability. In a related context, the chemical 
environment, set up by the interplay of both permeability and combined enzy- 
matic activity taking place within the compartment, will control the concentra- 
tions of potential substrates and products, as well as general properties such as 
pH [23]. These concentrations will directly control both the thermodynamic 
equilibrium of a particular process and its kinetics, profoundly shaping the cata- 
lytic potential of an organelle. 

Finally, it is important to have a working understanding of organelle biogenesis. 
Biogenesis is the process of organelle self-organizing and will control organelle 
shape, size, and copy number [24]. Repurposing efforts focused on existing lipid- 
based organelles have so far shied away from extensive remodeling, but it is logi- 
cal to assume future efforts will enable the complete refactoring of existing 
structures or even the de novo creation of novel compartments. Engineered bio- 
genesis has had more success in the protein-based space, as the genetic informa- 
tion required for synthesis is far less. BMCs contain roughly 10-15 proteins, and 
there are established systems for the transgenic expression of microcompart- 
ments in new organismal hosts. 

In understanding these four properties, we therefore seek a deep understand- 
ing of organelle structure and function. Put another way, the synthetic biology- 
minded goal of engineering novel organelles represents hypothesis testing to an 
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extreme — that is, if you can’t build it, you don’t understand it. From a historical 
perspective, it is important to realize that these are not new ideas. We simply 
have better molecular tools. To end, we quote from F.H. Gaertner, who posited 
similar ideas over three decades ago: 


The degree to which the majority of the cytosolic enzyme systems may be 
organized, and the manner in which such organization would endow these 
systems with one or more of the unique catalytic properties, stand as open 
questions. In order to answer these questions fully, our ultimate challenge 
may be to take a cell apart and put it back together again. [25] 


14.3 Protein-Based Organelles 


We begin with protein complexes, which represent a modular route to synthetic 
organelle construction. Nature widely uses substrate channeling in enzymatic 
complexes [5, 26], but decoupling compartmentalization from inherent enzy- 
matic function is challenging. For example, the substrate-channeling tunnel of 
tryptophan synthase is structurally intertwined with the « and B subunits and 
their active sites. Altering enzymatic function while maintaining channeling 
between active sites would require a tremendous protein engineering effort. A 
more sensible starting point is therefore an a priori functionally decoupled sys- 
tem, in which compartmentalization is a property distinct from enzymatic 
function. 


14.3.1 Bacterial Microcompartments 


BMCs are proteinaceous organelles that are functionally decoupled into shell 
proteins and cargo proteins (Figure 14.3a) [9, 27, 28]. The cargo proteins possess 
enzymatic function and generally constitute a small metabolic pathway of two to 
four reactions. The widespread prevalence and rich diversity of these organelles 
became evident in a recent bioinformatics study, which identified 23 types of 
BMCs in 23 phyla of bacteria [29-31, 134]. Functionally, BMCs can be grouped 
into two main categories: anabolic and catabolic microcompartments [11, 29]. 
The only known member of the anabolic group is the carboxysome, which per- 
forms carbon dioxide fixation in photoautotrophic and chemotrophic bacteria 
[32]. Catabolic BMCs (also called metabolosomes), as the name suggests, per- 
form various catabolic reactions that help break down nutrients. This class of 
BMCs accounts for most of the diversity reported [17, 29], but only two members 
have been extensively characterized: propanediol-utilizing (PDU) microcompart- 
ment [135, 136] and ethanolamine-utilizing (EUT) microcompartment [33, 137]. 

Despite this divergence, the three most-studied BMCs-—carboxysome, 
PDU, and EUT-share similar structural arrangement and mode of function 
(Figure 14.3a). The shell is formed principally by a ~100-amino-acid «/f protein 
possessing a canonical BMC domain (Pfam00936), which oligomerizes into a 
homohexamer roughly 70A in diameter [34]. Subsequently, this hexamer self- 
assembles into larger sheetlike structures that form the facets of the BMC shell. 
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Figure 14.3 The structure and function of a bacterial microcompartment, the carboxysome. 
(a) Structural model showing some of the structural components and enzymatic cargo. (b) 
Schematic of the carbon-concentrating mechanism of carboxysome function. (c) Model of 
shell permeability based on CcmK4 (PDB: 2A10). 


Examples of these proteins are CcmK2 and CsoS1A from the CB, PduA from the 
PDU, and EutM from the EUT [35]. There are roughly 4000-5000 copies of the 
protein per shell. Additionally, there is a minor protein component of ~90 amino 
acids that does not contain the canonical BMC domain and instead oligomerizes 
into a pentamer (Pfam03319), as determined via X-ray crystallography [36]. 
Cryo-electron microscopy (EM) studies of purified BMCs suggest that the over- 
all structure possesses a roughly icosahedral form [37, 38]. Topological con- 
straints therefore dictate that the pentameric protein forms the vertices of the 
icosahedron. There are 12 vertices in an icosahedron, placing the exact stoichi- 
ometry of the monomer at 60 copies. Biochemical evidence also suggests the 
pentamer is of very low abundance in purified BMCs. Examples of this family 
include CcmL and CsoS4A from the CB, PduN in the PDU, and EutN from the 
EUT. Intriguingly, although EutN crystallizes as a hexamer, protease cleavage 
experiments suggest it is a pentamer in solution [36, 39]. This heterogeneity in 
quaternary structure may explain the somewhat irregular form of isolated PDUs 
and EUTs in comparison with the more icosahedral-like CBs. The overall poly- 
hedral structure therefore consists of (roughly) 20 triangular facets built from 
thousands of shell proteins. Depending on the BMC and preparation protocol, 
the overall structure is of size 80-400 nm. Finally, cargo proteins are targeted to 
the lumen of the BMC through protein-protein interactions with the inside face 


14.3 Protein-Based Organelles 


of the shell. Depending on the BMC, there are thousands of copies of protomers 
(i.e., ~2000 RuBisCO (ribulose 1,5-bisphosphate carboxylase/oxygenase) mono- 
mers in a CB) targeted to the lumen [37]. The specific mechanisms of targeting 
are discussed in greater detail later. 

Before delving into the prospects of reengineering BMCs, it is important to 
understand their function in the native context. The CB was the first BMC to be 
discovered and characterized, and it remains the paradigm for BMC function 
[40]. We give an overview of CB function here to highlight themes of BMC func- 
tion (Figure 14.3b). Organisms that assimilate carbon using the Calvin—Benson 
cycle must compensate for the low affinity of RuBisCO for CO, and for its prom- 
iscuity— RuBisCO can also fix O2 in the same reaction at a cost to the cell. To 
overcome these limitations, cyanobacteria and many chemoautotrophs employ a 
carbon-concentrating mechanism, which consists of inorganic carbon trans- 
porters to increase intracellular bicarbonate levels, and the CB to facilitate car- 
bon fixation [41, 42]. After bicarbonate is actively transported into the cell, it 
passively crosses the CB shell (details on this later in text) and enters the CB 
lumen. The CB encapsulates two enzymes, carbon anhydrase and RuBisCO. 
Carbonic anhydrase interconverts bicarbonate into CO, and OH , and RuBisCO 
fixes this CO, onto ribulose 1,5-bisphosphate, which must also enter the lumen, 
and produces two molecules of 3-phosphoglycerate. 3-Phosphoglycerate then 
diffuses out of the CB and enters the reductive phase of the Calvin—Benson 
cycle. Although modeling indicates the major mechanism benefiting the reac- 
tion is an increased local concentration of CO, to improve the catalytic rate 
[23], additional possible mechanisms include excluding the competing substrate 
O, from the lumen, improving CO. channeling from carbonic anhydrase to 
RuBisCO via tight clustering of the enzymes [43], and raising the local pH around 
RuBisCO to increase its catalytic activity (Figure 14.3c). Most of the experimen- 
tal evidence for these hypotheses is indirect, for example, catalytically dead 
carbonic anhydrase mutants require high CO, concentrations to grow [30], 
suggesting further physiological experiments will be needed to describe the 
actual mechanism(s) used by the CB to facilitate carbon fixation. Finally, it is 
important to note that CB comes in two forms, the so-called « and B type [44]. 
They are differentiated by sequence in their shell and cargo proteins, particularly 
carbonic anhydrase, and by their genomic organization. In general, genes for a- 
CBs occur together in a single operon in the genome, while the B-CB regulon is 
composed of genes spread across the genome. Despite these evolutionary differ- 
ences, their catalytic activity is the same and their physiological role is assumed 
to be similar [45]. 


14.3.1.1 Targeting 

Although the shell is the defining feature of BMCs, it is the targeting of cargo that 
endows function. Targeting is mediated via protein—protein interactions and 
probably occurs concurrently with assembly of the shell itself. The first direct 
evidence for a shell-interacting motif was found in the PDU [46]. In this case, 
bioinformatic analysis of the propionaldehyde dehydrogenase (PduP) reveals an 
N-terminal extension found only in PDU-containing organisms. Alanine scan- 
ning, among other biochemical experiments, has shown this putative a-helical 
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signal sequence can interact with the hexameric shell proteins (PduA, PduJ, and 
PduK), and inclusion of this sequence allows encapsulation of foreign cargo 
[47, 48]. In addition, subsequent studies also reveal that the short N-terminal 
extension of the medium subunit (PduD) of adenosylcobalamin diol dehydratase 
(PduCDE) and an unknown protein PduV can target their respective cargo to 
PDU [49, 50]. Besides PDU, other catabolic BMCs, including EUT and a glycyl 
radical-based propanediol utilization (GRP) microcompartment, use signal pep- 
tides to encapsulate their respective cargo [51-53]. Strikingly, these targeting 
peptides have been shown to enable targeting of green fluorescent protein (GFP) 
to PDU, suggesting that the targeting specificity for these BMCs is not stringent 
and may be determined by the composition rather than the sequence of the tar- 
geting peptides [53]. This relaxed specificity allowed for the de novo construction 
of synthetic signal peptide for PDU targeting. With the growing repertoire of 
natural and synthetic signaling peptides, it may soon be possible to encapsulate 
multiple enzymes in a BMC to constitute a longer metabolic pathway. While the 
mechanism of encapsulation via signaling peptide is not completely understood, 
interesting applications have already emerged, including the construction of an 
ethanol nanoreactor by encapsulating pyruvate decarboxylase and alcohol dehy- 
drogenase in PDU [47] and compartmentalization of polyphosphate kinase 
(PPK1) in PDU to enhance the conversion of biological phosphates to cellular 
polyphosphate [54]. 

In contrast to catabolic BMCs, the targeting strategy used by carboxysomes is 
less well characterized, and most of the understanding came from B-CBs. Pull- 
down and yeast two hybrid experiments probing components of the cyanobacte- 
rial B-CB revealed that RuBisCO is anchored to the shell via specific interactions 
with the protein CcmM, which acts as an intermediate bridge between enzy- 
matic cargo and the shell. CcmM also possesses a nonfunctional carbonic anhy- 
drase-like domain and recruits the functional carbonic anhydrase, CcaA, forming 
a functional carbon-fixing complex [37, 38]. More recently, a CB protein CcmN 
was found to be essential for the shell recruitment during carboxysome assem- 
bly, and its deletion resulted in a large shell-less RuBisCO aggregate [55]. A C- 
terminal extension of this protein appears to interact with the major shell 
hexamer CcmK2 and is sufficient for targeting the GFP into CBs [56]. Therefore, 
CcmN may be the actual mediator between the shell and the CcmM/CcaA/ 
RuBisCO complex discussed earlier. In a-CBs, homologs of CcmM and CcmN 
are not present, but a poorly characterized protein called CsoS2 may perform an 
analogous function. CsoS2 is an intrinsically disordered protein with many 
amino acid repeats [57, 58]. These properties are often associated with proteins 
that function as “assembly coordinators” for large complexes, thus providing an 
important clue about the function of CsoS2 [59]. As evidence of the necessity of 
Cso82 to a-CB assembly, deletion of CsoS2 in Halothiobacillus neapolitanus 
abolishes carboxysome formation and renders the organism high-CO, requiring 
(HCR) [57]. Interestingly, it was shown that one csoS2 coding sequence produces 
two Cso82 isoforms via a co-translational mechanism [58], reminiscent of CcmM 
and CcmN in B-CBs. If the CsoS2 isoforms are indeed functionally analogous to 
CcmM and CcmN, understanding the way they interact with other carboxyso- 
mal proteins may shed light on how cargo in a-CB is encapsulated. Additional 
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work is required to understand the true role of CsoS2 and to develop a strategy 
to target foreign cargo to a-CBs. 

Despite these advances, several outstanding questions in carboxysome target- 
ing remain. Firstly, the stoichiometry of cargo proteins can vary over 10-fold, that 
is, there are roughly 100 protomers of carbonic anhydrase and 2000 protomers of 
RuBisCO, yet there is no mechanistic explanation for how this can be pro- 
grammed through protein-protein interactions alone. This will be critical to 
understand as future engineers attempt to balance flux through multistep enzy- 
matic pathways. Secondly, little is known about protein targeting in the a-CB. 
The a-CB from H. neapolitanus is a structurally robust BMC that can assemble 
without cargo and be transgenically expressed in Escherichia coli, making it an 
intriguing chassis for synthetic biological purposes [60, 61]. Making this a reality, 
however, will ultimately require a complete biophysical understanding of the tar- 
geting motifs and mechanisms. 


14.3.1.2 Permeability 

The structure of the shell proteins is thought to control permeability of the BMC 
(Figure 14.3c). X-ray crystal structures of various hexameric and pentameric 
shell proteins have revealed pores along the major axis of symmetry that, in prin- 
ciple, would facilitate passive diffusion of substrates and products. The pores are 
generally small (4-6 A in diameter), implying specificity [62]. In the case of the 
CB, positively charged residues are found at the narrowest area of the pore, sug- 
gesting a mechanism for screening for negatively charged molecules, such as the 
substrates/ products bicarbonate, ribulose 1,5-bisphosphate, and 3-phosphoglyc- 
erate, and against molecules without a dipole such as O2. Although there are few 
permeability measurements to support these hypotheses, physiological data 
clearly implies that there is minimal photorespiration (fixation of O2) when 
RuBisCO is inside the CB, suggesting O» exclusion may be one effect of encapsu- 
lating RuBisCO [63, 64]. In addition, csoS4-disrupted H. neapolitanus have a 
HCR phenotype, and their CBs leak CO, as interpreted by kinetic experiments 
[65]. Thus, a tight BMC shell appears to act as a gas barrier to exclude O and 
sequester CO. This theme is seen in other BMCs, as well. For instance, 
Salmonella enterica mutants that cannot produce PDU accumulate 10-fold 
increased levels of propanal in the cytosol [11]. Aldehydes, as nonspecific cross- 
linkers, damage DNA, and the 10x increase in propanal levels proved to be highly 
mutagenic. Similarly, alteration of pore-lining residues in PduA resulted in pro- 
panal leakage, reduced 1,2-propanediol influx, and increased glycerol influx, fur- 
ther substantiating the role of PDU shell as a selective diffusion barrier [66]. In 
the case of the EUT, shell mutants also leak their intermediate, acetaldehyde, but 
here, physiological data supports the hypothesis that the sequestration acts to 
stop the loss of a volatile intermediate out of the pathway. Thus, BMC shells can 
achieve many catalytic goals—enhancing pathway specificity and yield while 
reducing toxicity — by tuning their permeability. 

Recent X-ray structures highlight an expanded toolkit for altering shell perme- 
ability. The relatively small size of most pores (4—6 A) is at odds with the required 
permeability for larger substrates such as ribulose 1,5-bisphosphate or the cofac- 
tors coenzyme A and NAD used in the PDU and EUT. Kerfeld and colleagues 
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have solved the structure of the CsoS1D protein from the a-CB, which possesses 
a tandem BMC domain and forms a homotrimeric pseudohexamer with a much 
larger pore of 14A [67]. Although this protein is of low abundance and only 
recently detected in purified CBs, it may play an important role in allowing larger 
molecules passage into and out of the CB [61, 68]. Even more intriguingly, 
CsoS1D also crystallized in alternate conformations with both an open and 
closed pore. In follow-up work, Kerfeld et al. solved the structure of the ortholo- 
gous tandem repeat protein from the B-CB and again observed open and closed 
forms [69]. More recently, EutL has been shown to have negative allosteric regu- 
lation for pore opening by ethanolamine, and disulfide bonding may play a role 
in modulating the binding affinity toward ethanolamine [70]. These findings 
raise the possibility of posttranslational regulation of BMC permeability. 

Catalyzing redox reactions is a critical component of PDU and EUT activity 
and recent results also highlight the role of the shell in these processes. 
Overexpression of the Citrobacter freundii PDU and its components in E. coli 
led to the surprising realization that the shell protein PduT contains an Fe—S 
cluster on its major symmetry axis [71, 72]. This was confirmed via electron 
paramagnetic resonance and X-ray crystallography. The midpoint potential was 
measured at +0.099 V, suggesting the cluster may help recycle NADH, produced 
during the oxidation of propionaldehyde to propionyl-CoA, back to NAD”. 
Similarly, the shell protein GrpU of GRP microcompartment also coordinates 
Fe-S cluster [73]. While further experimental validation is required to demon- 
strate that these shells can truly participate in a redox reaction, it does under- 
score the potential for catalytic flexibility among BMCs. 


14.3.1.3 Chemical Environment 

A related property is chemical environment, including redox state, pH, interme- 
diate concentrations, and cofactor status. This results from the interplay of shell 
permeability and enzymatic activity in the lumen, creating steady-state concen- 
trations of molecular species different than what exists in the cytosol. For exam- 
ple, a recent mathematical model predicts that a relatively acidic carboxysome 
will exhibit higher equilibrium CO, concentration and, in turn, a higher degree 
of RuBisCO saturation [23]. This finding raises the possibility that the actual 
carboxysome may similarly be acidic in order to achieve maximum catalytic 
efficiency. Preliminary biochemical analyses of carbonic anhydrases from some 
B-CBs also suggest that the lumen environment may be oxidative, promote 
disulfide bond formation, and be a means of controlling protein activity 
[74, 75, 138]. Interestingly, CsoS2, the putative scaffolding protein of the «-CB, 
contains many cysteine residues, half of which are conserved across amino acid 
repeats. The abundance of cysteines may imply CsoS2’s participation in disulfide- 
bonding network within the carboxysomal lumen, which, if true, would explain 
the exceptional robustness of «-CB. 

The chemical environment is likely more extreme in the PDU and EUT. As 
described earlier, one explanation for PDU/EUT function is to sequester the 
buildup of toxic aldehyde intermediates away from the cytoplasm. S. enterica 
mutants with disrupted PDU shells accumulate 10x higher levels of cytosolic 
propionaldehyde (~15 mM), suggesting that luminal aldehyde concentrations are 
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extremely high [11]. Interestingly, there is little information on the effect such 
harsh conditions have on protein activity within the BMCs. Many metabolic 
pathway intermediates, such as the aldehydes present in the PDU, EUT, and 
many candidate biofuel pathways, will inevitably cause protein misfolding and 
inactivate individual enzymes in the complex. A key open biological question is 
therefore how the proteostasis of BMCs and their enzymatic content is regu- 
lated. A single BMC is on the order of 0.2% of total cellular protein (estimated 
from CB mass of ~250MDa [75] and E. coli protein content from BioNumber 
104879 [76]) and represents a tremendous investment for the cell. It remains to 
be seen whether BMCs are surveilled via the cell's proteostatic chaperones and, 
if so, whether this entire “costly” complex is turned over at the level of single 
inactive subunits or all at once. 


14.3.1.4 Biogenesis 

The critical information for biogenesis is an understanding of the genes and 
expression levels that are necessary and sufficient for BMC self-assembly and 
function. Early genetic, cloning, and sequencing efforts revealed that BMCs 
genes are often co-localized together in operons but that the degree of co- 
localization varies with each BMC. For example, a-CB genes cluster together as 
a single regulon in the genome of H. neapolitanus, while the B-CB regulon found 
in many cyanobacterial strains is composed of five operons spread across the 
genome (Figure 14.4). For this reason, successful efforts at reconstituting and 
transgenically expressing fully functional BMCs in a heterologous host have 
focused on those where genes are co-localized in a single genomic island. For 
example, screening of a C. freundii genomic DNA library identified a cosmid 
capable of endowing E. coli with the ability to metabolize 1,2-propanediol [71]. 
Sequencing of this cosmid and further molecular biology to narrow down the 
candidate genes identified a minimal subset important for the heterologous 
production of PDUs [50]. This has proven an important tool for studying the role 
of each gene in defining PDU structure and for identifying signal sequences. 
A similar approach was successful for the a-CB and EUT. Expression of the 


S. elongatus PCC7942 B-CB regulon 


ccm rbc ccm ccm 

K2 LM N O L Ss P K3 K4 ccaA 

——a SS SS eee) ES 7-7/7 

C_______ / —____)j ~—,,—_J Shell L~j— Carbonic 
Shell and associated RuBisCO Shell anhydrase 


H. neapolitanus a-CB operon 
rbc csoS4 csoS1 
L Ss csoS2 csoS3 A B C A B D 
Eh LS aed) 
U___,__J) Carboni a) 
Shell-associated ana 


RuBisCO anhydrase Shell 
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10 genes from H. neapolitanus led to fully formed CBs with a morphology nearly 
identical to those from the native host in E. coli [61]. Likewise, expression of 17 
genes from S. enterica can also produce EUTs in E. coli [51, 137]. 

Although these early attempts have defined genetic sufficiency, more work will 
be required to define essential, or necessary, elements. The ultimate goal, of 
course, would be to have a minimal system for expressing empty protein shells 
and targeting novel proteins to the lumen. One open question is the role of each 
shell protein and how many different genes are required to synthesize well- 
formed polyhedral shells. Most regulons possess multiple copies of genes for the 
hexameric and pentameric protomers. Whether this is a gene dosage mechanism 
for high protein expression or there is a functional difference between paralogs 
remains to be seen. It should be noted that the function of shell paralogs may be 
determined by their genomic position. This issue was brought to attention by 
Chowdhury and colleagues, who demonstrated that PduJ is permeable to 1,2- 
propanediol only when it is expressed from the pduA locus [77]. It is unclear why 
such a location effect exists, although it is thought that nascent PduJ translated 
from different genomic regions may encounter different sets of binding partners. 
Following from this observation, it may be possible to alter permeability of a shell 
protein by changing its gene location in lieu of the labor-intensive site-directed 
mutagenesis. 

Another factor related to biogenesis that will need to be clarified is the inher- 
ent stability of BMCs. It is known that CBs have a more icosahedral shape [78] 
and that «-CBs, in particular, are robust enough to be isolated from cells in a 
near-pristine form. Likewise, transgenic a-CBs display a somewhat native-like 
structure, suggesting that either the transcriptional and translational regulation 
of the operon sequence “ports” over better in E. coli or that the protein-protein 
interactions of «-CB self-assembly are inherently more robust. Future work will 
be necessary to clarify to what extent this hypothesis is true and whether it holds 
if BMCs are transgenically expressed in higher organisms such as yeast and 
plants. In fact, Lin and colleagues have already made the first attempt to produce 
carboxysomes in plants by expressing CcmM, CcmN, and three shell proteins 
(CcmK2, CcmL, and CcmO) from the B-CB in chloroplasts of Nicotiana bentha- 
miana, but the resulting empty compartments were irregularly shaped [79]. It 
would be of special interest to determine whether a similar experiment with an 
a-CB would result in more morphologically normal particles. 


14.3.2 Alternative Protein Organelles: A Minimal System 


There are also several other self-assembling protein complexes that, in principle, 
could be adapted to function as an organelle. These include viral particles, large 
enzyme complexes such as lumazine synthase, the ribonucleoprotein vault com- 
plex, and the icosahedral encapsulin complex, all of which have been studied to 
some extent in attempts to engineer novel materials for both in vivo and ex vivo 
applications [80-83]. Since it is not possible to cover all such applications in an 
appreciable depth here, we refer the reader to previous reviews [84, 85] and 
instead focus on one particular complex — encapsulin — that is a minimal alterna- 
tive to the more complex BMCs discussed earlier (Figure 14.5). 
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Figure 14.5 Model of an encapsulin. (a) Genomic organization in T. maritima highlight signal 
sequence of cargo protein. (b) Structural model based on encapsulin X-ray structure (PDB: 
3DKT) and ferritin-like protein (PDB: 3HL1). 


Encapsulins (also called nanocompartments) are a family of poorly charac- 
terized proteins that have the defining feature of assembling into 20-30nm 
icosahedral complexes (Figure 14.5). The founding member, Linocin M18 from 
Brevibacterium linens, was discovered as a secreted protein with bactericidal 
activity, but recent results question this biological function [83, 86]. Since then, 
the number of predicted encapsulins has increased dramatically, with the latest 
bioinformatics study reporting over 900 putative encapsulins across 15 bacte- 
rial and 2 archaeal phyla [87]. Encapsulins, like BMCs, also appear to be diverse, 
with four families of capsids and seven classes of associated cargo [88]. Despite 
the diversity, only a small number of encapsulins have been biochemically 
characterized, including those from Thermotoga maritima, Pyrococcus furio- 
sus, Mycobacterium tuberculosis, Myxococcus xanthus, and Rhodococcus jostii 
RHA1. 

The X-ray crystal structures of three different encapsulins from P furiosus, 
T. maritima, and most recently M. xanthus have been determined, clarifying 
many open structural and functional questions [83, 89] (Figure 14.5b). The 
structural shell is formed from a single protomeric protein that self-assembles 
into an icosahedral shell about 2 nm thick. In the Pyrococcus and Myxococcus 
variants, 180 protomers assemble into a structure 30 nm in diameter, while in 
the Thermotoga variant 60 protomers form a 20 nm structure, suggesting signifi- 
cant structural heterogeneity can exist between encapsulins. Like in BMCs, 
there are pores parallel to protomer symmetry axes. There are three distinct 
classes of pores, each possessing a diameter of about 5 A, located at the interface 
between two adjacent protomers, sites of fivefold symmetry and sites of 
threefold symmetry. While the first two classes show interspecies conservation 
of the chemical property of the pore-lining residues, the same is not true for the 
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threefold axis pores —they are positively charged in the Thermotoga encapsulin 
while uncharged in many other classes. The explanation for this divergence is 
yet unknown, although it may reflect the nature of the small molecules that must 
traverse the shell. 

This study also led to the identification of a putative signal sequence for encap- 
sulins. Bioinformatic analysis has revealed that two classes of enzymes, peroxi- 
dases and ferritin-like proteins, preferentially cluster in minimal operons 
adjacent to the shell-forming encapsulin gene. Serendipitously, there was addi- 
tional electron density in the Thermotoga structure abutting the inner face of the 
encapsulin shell. This density was of sufficient signal to identify a primary pep- 
tide sequence, which matched the C-terminus of the adjacent ferritin-like gene 
in the operon, establishing the link between the gene cluster and protein struc- 
ture. Deletion of the C-terminal region also disrupted targeting of the enzyme to 
the lumen, confirming this sequence is essential for targeting [83]. By employing 
this targeting sequence, many studies have reported successful targeting of 
heterologous cargo into encapsulins [90-92]. 

Encapsulins therefore have many advantages as potential synthetic organelles. 
They are in many ways a minimal version of BMCs. They assemble from a single 
shell protein into a compartment possessing about 1/100 the volume. This 
genetic simplicity likely ensures porting structures between organisms will be 
easier than for BMCs (Figure 14.5a). Preliminary experiments agree with this 
hypothesis — encapsulins from many organisms including B. linens, T. maritima, 
M. xanthus, and M. tuberculosis can be expressed heterologously in E. coli [87]. 
As an additional advantage, encapsulins commonly display exceptional resist- 
ance to temperature, pH, denaturant, proteases, and mechanical compression 
[83, 90, 91, 93, 94]. Therefore, they may serve as appealing alternatives for appli- 
cations that demand extreme conditions incompatible with other biological 
compartments. However, it appears the “addressability” —the number of proteins 
that can be targeted to its lumen-will be limited to one or two. For example, 
Snijder and colleagues employed native mass spectroscopy to show that one 
B. linens microcompartment precisely packages one hexamer of peroxidases, 
suggesting that the limited capacity is likely a valid concern [94]. In this vein, we 
imagine that very short pathways, that is, two steps with a single toxic intermedi- 
ate, would make excellent candidates for encapsulation. Future work will also be 
required to understand and engineer shell permeability. 


14.4 Lipid-Based Organelles 


The alternative to protein-based complexes is to leverage the natural organiza- 
tion of metabolism found in eukaryotes— membranous organelles. This makes 
practical sense as many key pathways of catabolism and anabolism are segregated 
at the organelle level, as discussed in the following text. In addition, much of our 
biological understanding of these processes comes from the yeast Saccharomyces 
cerevisiae, which is arguably the most important organism for metabolic engi- 
neering (Figure 14.6). Thus, there is already a working understanding of organelle 
targeting, permeability, chemical environment, and biogenesis. 
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Figure 14.6 Schematic of potential synthetic organelles in the budding yeast. 


14.4.1 Repurposing Existing Organelles 


14.4.1.1 The Mitochondrion 

The mitochondrion is the site of oxidative metabolism within the eukaryotic cell, 
facilitating both the citric acid cycle and B-oxidation of fatty acids, and is involved 
in numerous critical cell processes, including apoptosis [95]. Its function revolves 
around metabolism, and it possesses a singular chemical environment — relatively 
high pH (~8), low oxygen concentration, and a reducing redox environment [96]. 
Besides its commonly associated pathways, the mitochondrion also assists in sev- 
eral other biosynthetic pathways including iron-sulfur cluster biogenesis, heme 
biosynthesis, and, surprisingly, type I fatty acid synthesis [95, 97]. An understand- 
ing of mitochondrial biogenesis is still a work in progress, but its central role in 
metabolic diseases has led to a new appreciation of mitochondrial biology. 
Proteomics has revealed the parts list of mitochondrial components and putative 
pathways and also led to a deeper understanding of relevant synthetic biological 
issues such as protein targeting [98]. 

Recently, the mitochondria’s unique catalytic potential has been leveraged for 
metabolic engineering approaches. Farnesyl diphosphate (FDP) is a 15-carbon 
metabolic intermediate in the isoprenoid pathway. It is synthesized from the two 
isomers: isopentenyl pyrophosphate (IPP) and dimethylallyl pyrophosphate 
(DMAPP). Once synthesized, FDP can be processed by so-called sesquiterpene 
synthases into numerous products including molecules that are potential biofu- 
els and pharmaceuticals [99]. Farhi and colleagues hypothesized that since FDP 
stands at the intersection of isoprenoid biosynthesis, compartmentalization of 
its terminal reactions may enhance production [100]. This hypothesis was cor- 
rect, and targeting of a sesquiterpene synthase to the mitochondria using the 
known N-terminal targeting sequence of COX4 [101] led to a 3x increase in the 
final product, valencene [100]. Mitochondrial targeting of additional steps to 
produce the key intermediate, FDP, led to an additional twofold increase in final 
titers. Interestingly, despite these results, it is not known whether targeting is 
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successful due to higher levels of FDP in the mitochondrion or whether sesquit- 
erpene synthase is simply more active in the mitochondrial matrix. Circumstantial 
evidence from plants suggests that the mitochondria matrix contains both IPP 
and DMAPP, agreeing with the former and alluding to a more complicated bio- 
synthetic picture of mitochondrial function [102]. 

This approach has also been used for the production of another class of 
important chemicals, higher alcohols, which are potential gasoline replace- 
ments. Higher alcohols (aka fusel alcohols), such as isobutanol, are biosyn- 
thetically produced from the catabolism of amino acids via the Ehrlich pathway 
[103]. Interestingly, while the initial biosynthesis of amino acids occurs in 
mitochondria, the final Ehrlich decarboxylation and dehydrogenase reactions 
occur in the cytosol. Avalos and colleagues hypothesized that co-localizing the 
entire isobutanol pathway (derived from leucine) together could result in 
improved flux between enzymes and increased production [104]. This was 
indeed the case, and a complete mitochondrial-localized pathway resulted in a 
260% increase in production. Interestingly, control experiments found that co- 
localization of the same enzymes to the cytoplasm improves yields only a 10% 
increase, suggesting the mitochondria possesses an inherent biosynthetic 
capability. Indeed, the mitochondrial targeting system has been used to 
optimize the production of acetoin [105] and fumarate [106] in yeasts and 
artemisinin in plant [107]. 


14.4.1.2 The Vacuole 

The vacuole is the central degradative structure in fungi, such as S. cerevisiae, 
and is roughly the functional equivalent of the lysosome in mammals. It main- 
tains a low pH and possesses numerous hydrolytic enzymes involved in catabolic 
processes [108]. These properties led to the classic notion of the vacuole simply 
as the cell’s “trash can”. However, recent evidence suggest that the vacuole is a 
highly regulated structure, which carefully maintains stores of specific free sug- 
ars and amino acids, and is critical to cellular pH homeostasis, mitochondrial 
function, and replicative life span in yeast [109, 110]. Much of the specificity in 
this process results from the numerous transporters localized to the vacuole, 
which selectively transport individual sugars, amino acids, ions, and other 
species. Intriguingly, although many have been identified and cloned, some are 
simply hypothetical based on electrophysiology studies [111]. A better under- 
standing of this metabolic potential will be essential for future metabolic engi- 
neering efforts. 

One metabolite whose vacuolar accumulation has been exploited for 
metabolic engineering purposes is S-adenosyl methionine (SAM). SAM is the 
principal cellular currency for methyl transfer reactions and is a key cofactor 
in numerous enzymatic reactions. The majority of cellular SAM is stored in 
the vacuole [112]. Recently, Bayer and colleagues undertook a metagenomic 
approach to identifying enzymes involved in the biosynthesis of methyl hal- 
ides, industrially relevant commodity chemicals that can be upconverted to 
numerous other chemicals using zeolite catalysts [113]. During the initial 
work in E. coli, it was postulated that SAM concentrations were limiting pro- 
duction. Switching to yeast, Bayer et al. used a well-known targeting sequence, 


14.5 De novo Organelle Construction and Future Directions 


the N-terminus of carboxypeptidase Y, to deliver the Batis maritima methyl 
halide transferase to the vacuole and increase productivities for methyl iodide 
10-fold [114]. Further, taking advantage of the fact that SAM levels can be 
increased by altering media conditions, methyl iodide production was 
increased an additional fivefold by stimulating SAM production [115]. 


14.5 Denovo Organelle Construction and Future 
Directions 


In other cases, it may be advantageous to start with cellular structures that can 
be repurposed to a larger extent and, perhaps, to create organelle function 
de novo. The simplest version of this idea is to completely hijack an existing 
organelle with less essential function. For example, peroxisomes are oxidative 
organelles that sequester the toxic reactions methyltrophy and/or very-long- 
chain fatty acid catabolism, but are not required for cellular viability under most 
conditions [116]. As such, they are an intriguing target for engineering. Under- 
standing peroxisome biogenesis is still a work in progress, but proteomics exper- 
iments indicate the peroxisome contains about 10x fewer proteins than the 
mitochondrion, lending credence to the idea of simplicity [117]. Importantly, 
there are also well-defined targeting signals to both the matrix of the peroxisome 
and the membrane [118, 119]. The biosynthetic capability of the peroxisome 
has_ been exploited to improved production of biofuels such as fatty alcohols 
[120, 121] and alkanes [121]. 

A more ambitious area of research is to construct an entire organelle-like 
structure de novo. From a materials science perspective, organelles are formed 
when a set of molecular building blocks spontaneously self-organize, through 
molecular interactions, into complex patterns [24]. De novo design therefore 
requires identifying and engineering self-organizing building blocks. Additional 
properties that emerge from this process are organelle size/shape and copy 
number. These too must be accounted for. Recent work from Lim and col- 
leagues demonstrates how this may be possible [122]. By leveraging the various 
lipid binding and lipid synthesis/degradation domains from the phosphati- 
dylinositol signaling pathway, coupled with positive and negative feedback, 
Chau et al. were able to create pole-localized lipid microdomains. Given the 
large toolkit of phosphatidylinositol-binding domains, these microdomains 
could serve as the initial scaffold for generating more complex structures. In an 
orthogonal approach, Eriksson and colleagues demonstrated that overexpres- 
sion of an integral membrane lipid glycosyltransferase yields massive vesicle 
formation in E. coli [123]. Combining these approaches may enable the crea- 
tion of targetable distinct lipid-bound structures with controllable size and 
copy number, although this remains a tremendous challenge. However, such a 
synthetic organelle may also help to shed light on the natural organelle biogen- 
esis process [24]. 

Finally, it may be revealing to reflect upon even more complex engineer- 
ing challenges. Biology clearly takes advantage of compartmentalization far 
beyond a single genome. For example, the sea slug Elysia chlorotica carries out 
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kleptoplasty — the theft of an organelle—and spends much of its life living not 
as an animal but as a plant after acquiring its algal prey, Vaucheria litorea 
[124]. This intra-corpus symbiosis is maintained for the life of the slug by a yet 
unexplained mechanism [124]. It has been proposed this mechanism may 
involve the exchange of genetic material, but there are conflicting reports from 
differing experimental modalities as to whether algal DNA is actually incorpo- 
rated into Elysia’s genome [125, 126]. This inspires a remarkable research 
question: can kleptoplasty and endosymbiosis be engineered? It remains to be 
seen, but work from Silver and colleagues is an intriguing first step. Agapakis 
et al. found that cyanobacteria are surprisingly innocuous and do little to dis- 
turb viability when injected into zebrafish embryos [127]. Even more surpris- 
ing, cyanobacteria expressing invasin and listeriolysin can grow and divide, 
intracellularly, in macrophages while generating little to no immunogenic 
response. 

Alternatively, one could also imagine engineering extracellular symbioses. For 
example, one obvious use would be in biofuel production. There is considerable 
interest in constructing an organism for consolidated bioprocessing of plant- 
based biomass into fuel. This would entail engineering an organism for both fuel 
production and cellulose degradation, the major component of plant-based bio- 
mass [128]. An alternative to this approach would be to develop a stable cocul- 
ture of two or more organisms that accomplish the same thing. This would 
potentially be more modular as the chemical production pathway remains 
independent of sugar consumption. Interestingly, it has been found that stable 
communities composed of just a handful of bacterial species can indeed degrade 
cellulose [129]. Moreover, recent work using E. coli also suggests that stable 
mutualism can be predicted using metabolic flux modeling, which could help 
systematize future engineering efforts [130]. As a demonstration of this bottom- 
up strategy, Mee et al. designed and constructed a 14-member consortium of 
E. coli mutants that were able to survive up to 50 days, although it was ultimately 
dominated by only four strains [131]. In addition to single-species cocultures, it 
is possible to engineer stable mutualism between multiple species of microbes. 
For example, a synthetic fungal—bacterial consortium consisting of lignocellu- 
lose-degrading Trichoderma reesei and an isobutanol-producing E. coli strain 
can produce the branched alcohol from corn stover [132]. Most recently, Hays 
and colleagues paired a heterotroph such as E. coli, Bacillus subtilis, or S. cerevi- 
siae with the cyanobacteria Synechococcus elongatus PCC7942 mutant that has 
increased sucrose exporting ability. The resulting synthetic consortia can survive 
for a long period of time (weeks to months). In addition, by changing the hetero- 
troph, production of large quantity of enzyme amylase (in the case of B. subtilis) 
and polyhydroxybutanoate (PHB) (E. coli) could be achieved [133]. Therefore, 
this symbiotic platform shows promise as a new modular strategy for capturing 
light energy in the form of bioproducts. It remains to be seen, however, how 
tractable these communities will be to engineering to what extent they can be 
scaled up in an industrial setting. Nevertheless, this represents an interesting and 
untapped avenue for synthetic biology, perhaps informing both our understand- 
ing of endosymbiosis and the evolution of the cell as well as the interspecies 
interactions central to ecology. 
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15.1 Introduction 


Cell-free protein synthesis (CFPS) systems have transformed our ability to 
understand, harness, and expand the capabilities of biological systems. In the 
groundbreaking experiments of Nirenberg and Matthaei in 1961, CFPS played 
an essential role in the discovery of the genetic code [1]. More recently, a techni- 
cal renaissance has revitalized CFPS systems to help meet increasing demands 
for simple and efficient protein synthesis. Moving forward, this renaissance is 
enabling new processes never seen in nature, such as noncanonical amino acid 
(ncAA) incorporation and man-made genetic circuits. 

The driving force behind this development has been the unprecedented free- 
dom of design to modify and control biological systems that is unattainable with 
in vivo approaches [2-6]. The ability to “open the hood” of the cell and treat biol- 
ogy as a set of chemical reactions leads to many advantages for using cell-free 
systems, highlighted in Figure 15.1. First, the open reaction environment allows 
the user to directly influence the biochemical systems of interest (e.g., protein 
synthesis, metabolism, etc.). As a result, new components (natural and nonnatu- 
ral) can be added or synthesized and can be maintained at precise concentra- 
tions, while the chemical environment is monitored and sampled. Second, 
since the reaction is not “living,” cellular objectives, such as growth, can be 
bypassed. As is desirable in chemical transformations, cell-free systems sepa- 
rate catalyst synthesis (cell growth) from catalyst utilization (protein produc- 
tion), circumventing a major challenge afflicting in vivo engineering efforts. This 
is featured in Figure 15.2. Without living cells, timelines for process and product 
development can be faster and scale-up can be easier [4]. Although the CFPS 
technology offers many exciting advantages, challenges remain that provide 
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CFPS gives an unprecedented freedom of design 
to modify and control biology 


Open reaction environment 


Control added components precisely 


Monitor and sample reaction environment 


Bypass cellular objectives 
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Direct resources toward the exclusive production of one product 
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Figure 15.1 Advantages for cell-free biology. By bypassing cellular objectives and opening 
the reaction environment, cell-free protein synthesis allows for increased freedom of design as 
a result of the benefits highlighted here. 
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Figure 15.2 A new paradigm for cell-free biomanufacturing. Cell-free protein synthesis is able 
to separate catalyst synthesis (cell growth) from catalyst utilization (protein synthesis). This 
allows resources to be funneled toward the product of interest in ways not possible in vivo. 


opportunity for improvement. For example, many emerging cell-free platforms 
are not yet commercially available, and thus their broad impact is limited. In 
addition, cell lysis procedures can be difficult to standardize, leading to different 
extract performance across labs. Further, complex posttranslational modifica- 
tions (PTMs) (e.g., human glycosylation) are still limited or not yet shown. 
Finally, CFPS costs exceed in vivo methods for comparable organisms, which 
limit the scale for most academic labs. Despite these challenges, the benefits of 
CFPS are inspiring new applications from the synthesis of pharmaceutical pro- 
teins to the understanding of synthetic gene circuits [7]. 


15.2 Background/Current Status 


This review highlights achievements of the existing systems for crude extract- 
based protein synthesis. We begin with an overview of the state-of-the-art systems 
from different organisms. Then, we discuss their capabilities for protein produc- 
tion, highlighting applications that greatly benefit from the open environment 
and lack of cell viability of CFPS. Finally, we describe benefits for high-through- 
put applications and offer some commentary about the future growth of the field. 


15.2 Background/Current Status 


Crude extract-based CFPS harnesses the cell’s native translational machinery to 
produce proteins in a process that, instead of occurring in a live cell, becomes more 
like a chemical reaction. The crude extract contains the translational machinery, 
which consists of ribosomes, aminoacyl-tRNA synthetases, initiation factors, elon- 
gation factors, chaperones, and so on. In addition to the translational machinery, 
other enzymes exist in the extract: some are beneficial (e.g., those for recycling 
nucleotides or energy metabolism) and some are detrimental (e.g., those using CFPS 
substrates nonproductively). In combined transcription—translation reactions, the 
crude extract is added to a solution containing buffer, amino acids, nucleotides, 
RNA polymerase, a secondary energy source (for regenerating adenosine triphos- 
phate (ATP)), salts, and other molecules for maintaining the environment (e.g., 
dithiothreitol for a reducing environment or spermidine and putrescine for mimick- 
ing the cytoplasm). Thus far, when compared with the use of purified enzyme trans- 
lation systems, such as the PURE system developed by Ueda and colleagues [8], as 
well as New England Biolabs [9, 10], crude cell lysates offer significantly lower sys- 
tem catalyst costs and much greater system capabilities (e.g., cofactor regeneration, 
proteins produced per ribosome, and long-lived biocatalytic activity) [2, 11]. The 
primary crude extract-based platforms and trends will be discussed. 


15.2.1. Platforms 


15.2.1.1 Prokaryotic Platforms 

E. coli Extract The well-established E. coli system provides high protein yields 
(up to 2.3gl"') [12], as can be seen in Figure 15.3. The system has benefitted 
from its highly active metabolic activity, as well as the low-cost and scalability of 
fermentable cells for extract preparation [11]. Notably, the dilute cell-free sys- 
tem has decreased translation elongation rates compared with in vivo (~10-fold 
lower), which improves the expression of mammalian proteins [2]. While per- 
haps unexpected, it should also be noted that this platform has even had suc- 
cess synthesizing some complex, and even disulfide-bonded proteins [18, 19]. 
Additionally, well-developed genetic tools to make modifications to the source 
strain have been critical for developing synthetic genomes that upon cell lysis 
lead to improved protein production capabilities by removing negative effectors 
[20]. So far, a limitation of this system is its inability to produce PTMs, such as 
glycosylation. While PTMs could be enabled through the site-specific introduc- 
tion of ncAAs (see Section 15.4.1), for example, the inability to introduce PTMs 
has driven interest in developing eukaryotic platforms. 
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Figure 15.3 Historical trends for different CFPS systems. Batch protein yields for the papers 
cited in this review are arranged by platform (a) and product type (b). In addition, cell-free 
protein synthesis has seen successes at a variety of volumes (c). ECE, E. coli extract; WGE, wheat 
germ extract; SCE, S. cerevisiae extract; ICE, insect cell extract; CHO, Chinese hamster ovary cell 
extract; LTE, L. tarentolae extract; STR, Streptomyces extract; BSE, B. subtilis extract; PDMS, 
polydimethylsiloxane; GFP, green fluorescent protein; iPSCs, induced pluripotent stem cells; 
rhGM-CSF, recombinant human granulocyte macrophage colony-stimulating factor. 


Other Prokaryotic Platforms More recently, alternative prokaryotic platforms 
have emerged. These platforms have been based on Bacillus subtilis [21] and 
several Streptomyces strains: Streptomyces coelicolor [22], Streptomyces lividans 
[22], and Streptomyces venezuelae [23]. However, the goals around these produc- 
tion systems are more specialized. The B. subtilis platform was intended for 
promoter prototyping and genetic circuits with the hope of translating this to 
in vivo protein expression for metabolic engineering. Alternatively, the Strep- 
tomyces platform was intended for expression of GC-rich proteins, particularly 
for expressing and studying natural product gene clusters. 


15.2.1.2 Eukaryotic Platforms 

In contrast to the E. coli CFPS platforms, eukaryotic systems often produce com- 
plex proteins with higher percentages of soluble yields. However, they are ham- 
pered by comparatively low overall yields (e.g., an order of magnitude in standard 
batch reactions for similar model proteins) and costly scale-up. While wheat 
germ extract (WGE) has been the historical eukaryotic system of choice, several 
promising platforms for industrial use are also now emerging, which include 
extracts from Saccharomyces cerevisiae, insect cells, Chinese hamster ovary 
(CHO) cells, and Leishmania tarentolae, all of which are fermentable, providing 
possibilities for simple scale-up. 


15.2 Background/Current Status 


Wheat Germ Extract The WGE system has been the most productive eukaryotic 
system thus far, producing over 13000 human proteins in one study [24]. The 
WGE platform is able to achieve several endogenous PTMs. However, there are 
aspects of the platform that are not amenable to large-scale protein production. 
For example, batch yields are typically low (~1-10 yg ml"! luciferase) [25], the 
extract preparation is complex, and genetic modifications are challenging. That 
said, the semicontinuous format has been shown to produce 9.7 gl"! green fluo- 
rescent protein (GFP) [26]. This is remarkable, enabling the system to be a work- 
horse for crystallography, NMR, and structural biology studies. 


Yeast Extract Pioneered by the work of Iizuka and colleagues, several methods 
have been used for producing extracts from the yeast S. cerevisiae, which is 
another enticing option for a eukaryotic platform [27]. Like E. coli, it is easily 
grown in a fermenter. Also, the entire genome has been sequenced, and there is 
a wealth of biological tools, allowing for possible modifications to be made to 
improve protein production, which was important in the development of the 
E. coli platform. 

One method, developed by Wang and colleagues, starts by removing the 
outer membrane of the cell wall using lyticase, producing a protoplast. Then 
the protoplast is lysed with a 25-gauge needle. While this method is likely to 
maintain cellular compartments, the lyticase treatment is expensive on an 
industrial scale [28]. 

Other efforts have strived to be more viable as an economical and scalable 
system. These methods include the use of high-pressure homogenization for 
cell lysis, combined transcription/translation without need for mRNA capping 
[29], and a focus on technically simple extract preparation methods [25]. This 
new method was able to produce 7.69 + 0.53 ug ml’ active luciferase, giving it a 
fourfold improvement in relative product yield (ug $ reagent cost~') over the 
protoplast method. At this time, it is uncertain whether this approach retains 
cellular compartments after extract preparation, yet this is a very interesting 
question. Additionally, using a semicontinuous reaction format to feed limiting 
substrates (creatine phosphate, nucleotide triphosphates, and perhaps aspartic 
acid) while removing toxic by-products (inorganic phosphate) led to product 
yields of 17.0 + 3.8 ug ml‘ [30]. Other recent work with the system has explored 
alternative energy sources [31], fermentation conditions [32], 5’ mRNA leader 
sequences [33], and gene knockouts [34]. Despite recent work in this system, 
yields need to be further improved. To do so, a better understanding of the 
metabolism of the lysate is necessary. Also, elimination of background, nonpro- 
ductive translation would allow for more efficient use of reactants toward the 
protein of interest. 


Insect Cell Extract Insect cell extract (ICE) systems are another promising 
platform for eukaryotic CFPS. This approach uses ovary cells of Spodoptera 
frugiperda, the fall armyworm, an industrial in vivo protein expression system 
[35]. Typical yields for the ICE system are ~45ugml' luciferase [25]. Using 
mechanical lysis and mild treatment of the extract, a process developed by 
Kubick and colleagues is able to retain microsomal vesicles of the endoplasmic 
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reticulum (ER) within the extract [36]. These vesicles are important for traffick- 
ing proteins into the ER for membrane insertion and PTMs. The Kubick lab 
has exploited this by producing membrane proteins, which are able to co- 
translationally insert into the lipid-enclosed vesicles for stability, as well as 
glycosylated proteins, both of which will be described later [36, 37]. In addition 
to glycosylated and membrane proteins, the ICE system has also been demon- 
strated to incorporate ncAAs using a plasmid developed by the Schultz lab for 
use in S. cerevisiae [38]. 


Chinese Hamster Ovary Cell Extract CHO cells are widely used industrially for the 
expression of human recombinant proteins [38]. A benefit is their ability to 
achieve mammalian PTMs, which remains a challenge. Using the same extract 
preparation method as ICE, the Kubick lab has begun to develop a highly effi- 
cient and high-yielding CHO cell extract. To achieve glycosylation and produce 
membrane proteins, the reaction mixture can be enriched with microsomal vesi- 
cles, yielding 30-50 ug ml! of the protein of interest (e.g., luciferase) [38, 39]. 
This platform offers exciting opportunities for developing advanced process 
development pipelines for discovering and assaying protein therapeutics, which 
can be directly translated in vivo. 


Leishmania tarentolae Extract L. tarentolae, a lizard parasite, is a fermentable 
protozoan that was chosen for CFPS. The in vivo expression system is able to 
produce disulfide bonds and glycosylation, and the cells are easy to genetically 
modify [40, 41]. For extract preparation, a nitrogen cavitation method is used 
for lysis [42]. A key for the system is that the native mRNA all has the same 
“splice leader” sequence, allowing for inhibition of endogenous mRNA using 
an oligonucleotide [40]. This prevents background translation, allowing 
resources to be directed to synthesis of the protein of interest, producing 
50 ug ml! GFP. Using the L. tarentolae platform, Mureev and colleagues were 
able to develop species-independent translational sequences (SITS), which 
allowed for translation in not only L. tarentolae platform but also E. coli and 
several eukaryotic cell-free platforms, presumably by a cap-independent path- 
way [40]. It is expected that this system will aid in expressing proteins from 
parasitic genomes to test their functions and annotate parasitic genomes, 
including that of L. tarentolae [43]. 


15.2.2. Trends 


Several trends can be observed in the development of the aforementioned cell- 
free platforms. First, the recent development of several eukaryotic CFPS plat- 
forms highlights the enthusiasm and growth of the field. 

Second, yields continue to increase for CFPS, with a majority of products 
expressed in the E. coli platform as seen in Figure 15.3a, which catalogs the 
proteins expressed from manuscripts covered by this review. These improve- 
ments have occurred as a result of improved soluble yields for the E. coli plat- 
form and increased overall yields for the eukaryotic platforms. One method 
that has been useful in the E. coli system was the use of fusion partners to aid 
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aggregation-prone proteins [13, 44]. Also, moving from glucose to starch as an 
inexpensive energy source allowed for better pH maintenance, increasing solu- 
ble enhanced GFP from 10% to 25% in a study by Kim and colleagues [45], as 
well as by Caschera and Noireaux [12]. The manuscript by Caschera and 
Noireaux achieved the highest batch CEPS yield to date or 2.3g1”' superfolder 
GFP. The increased yields and decreased cost have enabled the use of freeze- 
dried lysates for solving cold chain problems with on-demand synthesis of pro- 
teins for therapeutics [46, 47] and diagnostics [48, 49]. In contrast to prokaryotic 
systems, eukaryotic systems generally produce a higher soluble portion but are 
working toward increasing overall yields cost effectively. So far, this has typi- 
cally involved reducing background translation, although there are many excit- 
ing opportunities for strain engineering. A target goal in the upcoming years is 
to enable eukaryotic batch CFPS yields of greater than 0.5mgml"', which is 
chosen because it is about an order of magnitude higher than current levels. 

Third, there is also an effort to reduce cost for CFPS. This has been done by 
moving toward lower cost energy sources, as well as streamlining the process. 
Instead of fueling the reactions with substrates containing high-energy phos- 
phate bond donors, such as creatine phosphate or phosphoenolpyruvate, E. coli 
reactions have been shown to use glucose and starch as well as nucleoside 
monophosphates in lieu of triphosphates, greatly reducing cost [12, 45, 50]. So 
far, eukaryotic systems have not been able to activate cost-effective energy 
metabolism from non-phosphorylated energy substrates, which will be critical 
for any industrial-scale applications. Toward more robust and consistent extract 
preparation methods, extract protocols have been streamlined [51-53]. Another 
method has combined the small molecules in the reaction into a premix, used T7 
polymerase from a crude lysate without purification, and reduced extract prepa- 
ration by two steps [54]. 

Fourth, over the last decade, efforts to synthesize complex proteins have inten- 
sified. Figure 15.3b, which organizes the values from Figure 15.3a by product, 
highlights the shift from production of standard reporter proteins, such as lucif- 
erase and GFP, toward products containing ncAAs, glycosylation, and disulfide 
bonds as well as membrane proteins. We expect this trend to continue, particu- 
larly given the freedom of design in adjusting cell-free components by the direct 
addition of new components. 

Finally, we note that cell-free platforms have been able to span 17 orders of 
magnitude in terms of reaction volumes (Figure 15.3c). Notably, the E. coli sys- 
tem has been shown to scale linearly from 250 ul reactions to 1001, an expan- 
sion factor of 10°, producing 700mgml”' to enable manufacturing scale 
synthesis of soluble human granulocyte macrophage colony-stimulating factor 
(GM-CSF) with two disulfide bonds [4]. In the other direction, there has 
recently been a move toward smaller, microbe-mimicking reaction sizes 
[14, 15]. These efforts are useful for high-throughput applications and bread- 
boarding of genetic circuits, both of which will be described later. To learn 
more about economical scale-up of cell-free systems, see reviews by Swartz [2] 
and by Carlson et al. [6]. 

The improvements in yields and cost, as well as scalability, give CFPS great 
utility. Examples of its applications are highlighted in the next section. 
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15.3 Products 


CFPS allows the opportunity to not only produce proteins that standard meth- 
ods are able to produce but to also solve expression problems with proteins that 
are notoriously difficult to synthesize in vivo. Examples of such products are 
described in the following section. 


15.3.1 Noncanonical Amino Acids 


Site-specific incorporation of ncAAs into proteins opens many doors for the 
production of proteins with new structures, functions, and properties. For such 
applications, cell-free systems have an advantage over in vivo systems because of 
their open environment and lack of need for cell viability. Indeed, recent efforts 
by Albayrak and Swartz [55], as well as Jewett and colleagues (unpublished), have 
shown the ability to synthesize greater yields of protein in batch CFPS reactions 
as compared with the in vivo approach. The benefit appears to come from the 
fact that the orthogonal translation systems can be toxic to the cell. Moreover, 
the ncAA can be added directly to the reaction mixture, instead of relying on 
cellular uptake, and ncAAs can be used that would otherwise be toxic to cells. 
This technology has been used in cell-free systems to polymerize proteins [16], 
conjugate human erythropoietin to a fluorophore in ICE [38], and modify the 
oncoprotein c-Ha-Ras in the WGE [56], along with many others. 

The most common method for ncAA incorporation is through amber suppres- 
sion, which inserts the ncAA at the location of the amber stop codon (UAG) in the 
reading frame of the gene of interest. With the addition of an orthogonal tRNA, 
orthogonal aminoacyl-synthetase, and ncAA, the UAG can be incorporated at a 
specific location in the gene, allowing for the template-encoded addition of the 
ncAA, as seen in Figure 15.4. This method has been extended to insert a second 
amino acid using the ochre stop codon (UAA) in combination with the amber 
codon for the incorporation of two unique ncAAs ina CFPS reaction [57]. Recent 
advancements from Albayrak and colleagues allow for the synthesis for the 
orthogonal tRNA (o-tRNA) during the protein synthesis reaction, improving 
scale-up possibilities [55]. One problem that plagues amber suppression both 
in vivo and in vitro is competition between the o-tRNA and release factor 1 (RF1). 
One solution to this problem is to use a different system for incorporation, using 
a four-nucleotide codon [58]. Further, cell-free systems open the possibility of 
expanding the genetic code by introducing additional Watson—Crick base pairs 
[59] and hijacking sense codons [60]. Since cell viability is no longer an issue, 
other options remove the problem with RF1 by either adding an aptamer to inhibit 
it [58] or tagging RF1 and removing it prior to protein synthesis [58, 61]. Looking 
forward, the development of an RF1 deletion strain as a chassis for CFPS will open 
new avenues for using cell-free synthetic biology for synthetic chemistry [62]. 


15.3.2 Glycosylation 


For any protein synthesis technology, glycosylation cannot be ignored. It is esti- 
mated that over 50% of human proteins are glycosylated [63]. For pure chemical 
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Figure 15.4 The production of proteins containing ncAAs is a frontier of CFPS. Amber 
suppression, shown here, is the most common method for ncAA incorporation in CFPS 
platforms but is hampered by competition between the amber-suppressing tRNA and release 
factor 1 (RF1). Several methods have been developed to prevent this competition. Also, new 
strains lacking RF1 should address this issue. 


synthesis, the stereochemistry of sugars is challenging to make consistently [64], 
and for in vivo protein production, one must use mammalian cells, which are 
significantly more challenging and more expensive to culture than E. coli. This 
motivates a need for a fast, accurate method for producing glycoproteins using 
CFPS systems. 

Initial work on the production of glycoproteins in CFPS was reported in 1978 
by adding canine pancreas microsomes, containing glycosylation machinery, toa 
WGE reaction [65]. More recently, Guarino and colleagues chose to use the 
E. coli cell-free platform for synthesizing glycoproteins by adding the 
Campylobacter jejuni glycosylation machinery [66]. Since E. coli has no native 
glycosylation machinery, there was no mixture of glycosylation products. 
Also, due to the open environment of the system, the substrates could be 
directly added to the reaction to achieve N-linked glycosylation. Alternatively, 
the ICE system is able to maintain microsomes due to the method of lysate pro- 
duction [36]. These microsomes allow for N-linked glycosylation, as well as aid 
in the production of membrane proteins, described later. The CHO cell system 
had similar results to the ICE system [39]. While efforts to make glycoproteins 
are underway, there are still two drawbacks: no system is yet able to produce 
human glycosylation patterns and efforts to achieve O-linked glycosylation are 
limited. Addressing these limitations will open new avenues for studying and 
engineering glycosylation. For example, our ability to study and control glyco- 
sylation outside the restrictive confines of a cell will help answer fundamental 
questions such as how glycan attachment affects protein folding and stability. 
Answers to these questions could lead to general rules for predicting the struc- 
tural consequences of site-specific protein glycosylation and, in turn, rules for 
designing modified proteins with advantageous properties. 
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15.3.3 Antibodies 


Antibodies and their variants, typically tackled by in vivo recombinant protein 
methods, have recently gained much attention largely due to their high specific- 
ity [67]. However, in vivo methods, particularly in prokaryotic cells, can be a 
challenge when producing high concentrations of antibodies due to their aggre- 
gation, leading to insolubility [68]. Yin and colleagues faced this challenge when 
producing full-length antibodies in the E. coli extract (ECE) platform. Notably, 
they observed that the heavy chain (HC) was more prone to aggregation and 
needed the light chain (LC) for soluble co-expression [17]. This was an easy 
problem to solve with the open reaction environment of CFPS. They first 
expressed the LC plasmid for 1h and then added the plasmid for the HC to start 
its translation. This strategy produced 300mg!” aglycosylated trastuzumab in 
reactions ranging from 60 ul to 41 at greater than 95% solubility. Martin et al. 
were able to then translate this lesson in plasmid timing, as well as oxidizing 
conditions and chaperone addition, to the CHO CFPS platform for the expres- 
sion of >100mg!" active, intact mAb [69]. In addition to the full-length antibody, 
antigen-binding fragments [19] and single-chain variable fragments [18, 70, 71] 
have been produced in a variety of cell-free systems. In fact, notable work by 
Kanter and colleagues created fusion proteins of a tumor-derived scFv with 
GM-CSF (a cytokine) or nine amino acids from interleukin-1{, which improved 
potency of the scFv by increasing immune system stimulation for cancer therapy 
[18]. These advances demonstrate the merits of CFPS systems as a potentially 
powerful antibody production technology. However, cell-free antibody produc- 
tion still struggles from a lack of human glycosylation, which could be achievable 
in the future through the aforementioned glycosylation methods or ncAA incor- 
poration and coupling of the oligosaccharides. 


15.3.4 Membrane Proteins 


Membrane proteins are an excellent application for CFPS. Chemical synthesis of 
membrane proteins can take 1-2 weeks [72], while in vivo methods struggle with 
obtaining high yields, minimizing degradation, and maintaining cell viability 
[73]. Cell-free systems speed up the process to a matter of hours with decreased 
proteolysis and no need to maintain living cells. Indeed, CFPS of membrane pro- 
teins has received considerable attention in recent years. For example, it has 
aided in the determination of protein structures, via NMR and crystallography, 
which were previously impossible, such as ATP synthase and G protein-coupled 
receptors (GPCRs) [74—76]. The challenge is finding a suitable substitute for the 
lipid bilayer. As seen in Figure 15.5, these substitutions include the use of deter- 
gents (in micelles or bicelles) [74, 77, 78], liposomes [74, 75, 77, 78], nanodiscs 
[76, 79, 80], tethered bilayer lipid membranes (tBLMs) [81, 82], and microsomal 
vesicles [36, 37]. 

One option is to produce the protein, precipitate it, and then solubilize it in 
detergents or liposomes; however, this does not allow for ideal structure and 
function studies because it is not an accurate membrane mimic [83]. Further, 
some detergents cannot be added to the reaction in high concentrations 
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Figure 15.5 CFPS is a useful approach for the production of membrane proteins. Several 
methods have been implemented to mimic the cell membrane in cell-free protein synthesis: 
(a) lipid bilayer, (b) liposome, (c) micelle, (d) bicelle, (e) nanodisc, and (f) tethered bilayer lipid 
membrane. 


because they inhibit transcription and translation [75, 83]. These methods also 
do not take advantage of the open reaction environment of cell-free systems. 
Unlike cells, where it is impossible to add chemicals directly to the protein as it 
is synthesized, CFPS allows for co-translation into liposomes, nanodiscs, 
tBLMs, or microsomes, all of which can be added exogenously to the reaction. 
Nanodiscs, consisting of a lipid bilayer surrounded by a protein scaffold, were 
found to be a better mimic of the lipid bilayer and thus obtained higher yields 
of soluble membrane proteins when compared with detergents and liposomes 
[80]. In fact, a functional GPCR, a highly studied but difficult to produce pro- 
tein, was first produced in soluble form using nanodiscs in a cell-free reaction 
[76]. Another useful aspect of nanodiscs is the ability to co-express the nano- 
disc protein scaffold and membrane protein in the cell-free reaction, reducing 
the number of production and purification steps necessary [79]. For deeper 
structural and functional studies, the tBLMs use self-assembly to attach a 
membrane to a gold surface. The protein can then be co-translationally inserted 
into the membrane and immediately studied using surface plasmon-enhanced 
fluorescence spectroscopy (SPFS) and imaging surface plasmon resonance 
(iSPR), fluorescence polarization (FP) [84]. Similar to the tBLMs, CFPS has 
also been used in conjunction with a phospholipid bilayer supported on quartz 
crystal microbalances for direct characterization of membrane proteins as they 
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are expressed [85]. CFPS of membrane proteins promises to help unravel the 
function and structure of many potential drug targets. 


15.4 High-Throughput Applications 


Processes that take days or weeks to design, prepare, and execute in vivo can 
often be done more rapidly in a cell-free system. The use of polymerase chain 
reaction (PCR) templates significantly speeds up the process, since no time- 
consuming cloning steps are needed. Also, since the cell-free system is simpler 
and easier to control than cells, it allows for direct manipulation of reaction envi- 
ronments, as well as optimization of the reaction conditions. These characteris- 
tics are highlighted in the following examples of high-throughput protein 
synthesis for both production and screening as well as genetic circuit designing 
and testing. 


15.4.1 Protein Production and Screening 


While chemistry has been able to produce small molecule libraries for easy 
screening, the ability to produce proteins for similar procedures has been chal- 
lenging. However, with cell-free systems, there is no need to transform cells with 
plasmids, produce the protein, and then lyse the cells. Instead, a PCR template or 
plasmid can be added to a small reaction mixture in a plate, the protein can be 
produced, and then the various proteins on the plate can be screened in situ, all 
in a matter of hours [86]. For example, Karim and Jewett expressed several 
enzymes in a CFPS reaction for prototyping metabolic pathways in E. coli lysates 
in order to quickly arrive upon the best combination of enzymes for the produc- 
tion of butanol [87]. Since CFPS reactions are at a small scale, microfluidics can 
also be used to supply small molecules [88] or when the number of reactions 
becomes too large, liquid handling can easily be automated [89]. One of the most 
impressive examples of using CFPS for high-throughput protein production is 
the human protein factory [24]. In this study, the authors expressed 13,364 
human proteins using the WGE platform and then compiled the protein expres- 
sion information in an online database [24, 90]. 

In addition to producing proteins from standard plasmids and PCR prod- 
ucts, it is possible to produce protein arrays from DNA arrays. Since DNA 
arrays are much easier and more stable than protein arrays, He and colleagues 
developed a method to “stamp” the proteins on a new array by putting a DNA 
array plate face down on a second plate with the CFPS reaction mixture 
between the plates [91]. After the proteins were produced, they associated 
with the surface of the new plate. Stoevesandt and colleagues demonstrated 
the utility of this method when they produced an array of 116 distinct pro- 
teins [92]. In addition to its ease, it was found that one DNA array was able to 
produce at least 20 new protein arrays [91]. Protein arrays are beginning to 
enable an improved toolbox, and a faster process to probe different aspects of 
protein function and their role in enzyme screening will continue to grow in 
the upcoming years. 


15.5 Future of the Field 


15.4.2 Genetic Circuit Optimization 


There is currently a need for “breadboarding” of in vivo biological circuits in 
order to accelerate the design—build—test loops associated with synthetic biol- 
ogy studies. Biological circuits rely on regulation and control of protein products 
and can take a long time to assemble in vivo, so a system is needed that will func- 
tion similarly to the cell with faster results and greater flexibility for manipula- 
tion: a great application for CFPS platforms. The combinatorial nature of testing 
the variations of the circuits also lends itself to high-throughput methods. Also, 
the open environment of the CFPS reaction allows for more control for these 
studies, since the initial concentrations of mRNA and protein as well as the exact 
reaction size can be directly manipulated. Methods have been developed to 
characterize parts (e.g., promoters, ribosome binding sites, terminators, and 
spacing), as well as multienzyme systems, such that they function predictably 
both in vitro and in vivo [21, 23, 93-96]. In one such example, Chappell and col- 
leagues recognized that ribosome binding sites correlated directly when using 
PCR products in vitro, but promoters did not [94]. Thus, they used a USER- 
ligase method to circularize PCR products, the results of which were able to 
correlate between both platforms while keeping production time short by avoid- 
ing the need for a plasmid typically obtained by cell growth. In addition to char- 
acterization, cell-free systems have been used to test new options for circuit 
proteins, such as endogenous sigma factors, to supplement the common Lacl 
and TetR proteins [7]. Aiding in the high-throughput area, reactions at the nano- 
liter, picoliter, and femtoliter scales are being explored as a method to better 
approximate the volume of a cell. This involves using microfluidics to feed small 
molecules to the reaction [15, 97], which diffuse well due to the small volume, as 
well as studying noise in gene expression [14], which could aid in the future 
design of gene circuits. To learn more about in vitro genetic circuits, see a review 
by Hockenberry [98]. 


15.5 Future of the Field 


CFPS is emerging as a disruptive technology. It has promising applications for 
rapid, high-throughput screening and production of enzymes and personalized 
medicines, membrane proteins, and proteins containing ncAAs. Other applica- 
tions include efforts to construct fully synthetic ribosomes in vitro [99] as well as 
artificial cells [7, 100]. Equally important, CFPS is expected to help address the 
increasing discrepancy between genome sequence data and their translation 
products. The Sargasso Sea expedition alone, for example, generated 1.2 million 
new genes, many with unknown function [101]. This concept has already been 
proven by the expression of the entire T7 bacteriophage genome [102] as well as 
nanoassemblies of T4 bacteriophage structural proteins [103]. Unfortunately, 
current cell-based technologies for heterologous protein expression have been 
unable to meet the rapidly expanding need for affordable, simple, and efficient 
protein production because they (i) can be slow (requiring time-consuming 
cloning strategies), (ii) can require laborious protein purification procedures, 
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and (iii) can lack robustness and predictability due to several reasons: the com- 
plexity, the host-dependent gene expression and protein folding/function, the 
necessity of product export from the cell membrane for improved production, 
and the toxicity of high levels of expressed proteins to the host. CFPS can address 
many of these limitations to help complement existing technologies, but there 
are remaining immediate challenges. For example, the field is limited by its abil- 
ity to produce posttranslationally modified proteins at high titers, particularly 
those with human patterns. Moreover, we still do not have the protein equivalent 
of PCR. Further, inefficiencies in site-specific incorporation of ncAAs limits 
innovation. By addressing these challenges, we anticipate that cell-free systems 
will continue to penetrate and be recognized for value by industry. Given the 
capability to modify and control cell-free systems, CFPS holds promise to be a 
powerful tool for systems biology, for synthetic biology, and as a protein produc- 
tion technology in years to come. 


Definitions 


Cell-free protein synthesis is the process of translating proteins in lysates 

In vitro is the processes performed outside of their biological context, e.g. pro- 
tein synthesis occurring outside the cell 

Noncanonical amino acid is any amino acid outside the 20 canonical amino 
acids 

Glycosylation is the addition of sugar moieties to proteins 

Antibody is the protein of the immune system that recognizes and neutralizes 
pathogens 

Membrane protein is the protein that is associated with or integrated into a 
cellular membrane 

High-throughput is the capability of being performed many times in parallel 

Protein screening is the process of testing one or more proteins or protein variants 
in one or more contexts to determine properties of the protein(s) or optimize 

Genetic circuit is the engineered use of DNA sequences to control biological 
reactions and programs 
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16.1 Introduction 


Pathways, which are cascades of biochemical reactions catalyzed by enzymes, 
maintain the vitality of all living organisms. These biochemical routes have been 
exploited to produce numerous commodities since early civilization, such as 
beer, wine, and cheese. With the advance of biotechnology, various genetic tools 
have become available for construction and manipulation of pathways to effi- 
ciently convert renewable feedstock to value-added compounds such as specialty 
chemicals, pharmaceuticals, and biofuels [1]. Microbial production of these 
compounds is usually enabled by overexpressing endogenous or heterologous 
enzymes of the corresponding pathways. However, overexpression of pathway 
enzymes alone can be insufficient for optimal metabolite production due to an 
imbalanced flux through the pathway [1, 2]. A typical symptom of flux imbalance 
is the accumulation of unwanted and even toxic intermediates [3, 4], which can 
be detrimental to the productivity of desired compounds. There is seldom 
a straightforward strategy to resolve the non-product accumulation because 
enzymes within the pathway are not independent; instead the enzymes are inter- 
twined and cross-regulated among the pathway enzymes and among the cell’s 
intricate metabolic networks. Due to this complexity, rationally engineering a 
pathway to improve its efficiency is a significant challenge. To this end, random 
approaches can be preferred over rational design in pathway engineering [5]. 
Random engineering approaches to optimize pathways generally screen through 
large and/or combinatorial pathway libraries. Pathway libraries have been con- 
structed for diverse gene expression based on promoters of different strengths 
[6], varied intergenic regions affecting mRNA stability [4], or engineered riboso- 
mal binding sites (RBSs) of diversified translational initiation rates [7]. 

In previous studies [4, 6-8], the pathway libraries were assembled by restric- 
tion digestion/ligation or overlap extension polymerase chain reaction (PCR). 
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These traditional assembly methods were limited in complexity of design, being 
forced to rely on the multiple-cloning site (MCS) for pathway assembly, and had 
low assembly efficiency. In recent years, a number of new DNA assembly meth- 
ods have been developed, such as DNA assembler [9], sequence and ligation- 
independent cloning (SLIC) [10], Gibson assembly [11], circular polymerase 
extension cloning (CPEC) [12], Golden Gate cloning [13], and BioBrick stand- 
ards [14]. These advanced DNA assembly methods have ameliorated the design 
constraints on heterologous pathway construction and simplified the assembly 
of multi-gene metabolic pathways. The improved efficiency of these methods 
allows for larger and unbiased library creation, while the modularity of the 
methods greatly facilitates the generation of complex combinatorial libraries 
(Figure 16.1). The following chapter will include a brief description of the 
advanced assembly methods that could be applied to combinatorial pathway 
libraries. Some of the most recent work in pathway library generation using these 
methods will then be discussed as well. 
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Figure 16.1 Overview of the combinatorial library approach for pathway improvement. When 
improving a multi-gene pathway, variations of the pathway components including promoters, 
RBSs, coding DNA sequences (CDSs), or transgenic regions are generated by either 
mutagenesis, homolog cloning, or in silico design (promoters and CDSs are used as examples 
in the figure). The diversified components are then assembled by various DNA assembly 
techniques to form a library of combinations. Cells hosting this pathway library will then be 
screened for the optima of the desired phenotype. Labels “p1-3” standard for promoters. 
Labels “t1-3” standard for terminators. 
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The following methods have the potential to be used for pathway library creation 
(Table 16.1). The advanced assembly methods exploit diverse strategies for 
pathway construction such as homologous recombination, DNA polymerase 
extension, and advanced applications of restriction digestion/ligation. 

The following assembly strategies are based on homologous recombination 
and DNA repair mechanisms: DNA assembler, Gibson assembly, and SLIC. In 
the DNA assembler strategy, the endogenous in vivo homologous recombina- 
tion mechanism in yeast is used to create large pathways in a simple, one-step 
manner [9, 15, 16]. The DNA fragments to be assembled are PCR amplified by 
oligos designed with an 80-bp homologous region to the 5’ and 3’ neighbor- 
ing DNA sequences within the pathway. The linear DNA fragments are co- 
transformed with the linear plasmid backbone into Saccharomyces cerevisiae, 
and the homologous regions are recognized by the endogenous homologous 
recombination machinery and “repaired” into a single DNA molecule. 

Mimicking the in vivo homologous recombination mechanisms, in vitro 
assembly has been accomplished by SLIC and Gibson assembly. SLIC is a two- 
step DNA assembly method [10], which utilizes a 30-bp homology region. The 
linearized host vector and the insert DNA fragment are separately treated with 
T4 DNA polymerase in the absence of deoxynucleotide triphosphates (dNTPs), 
which chews back the 3’ terminal end. This generates a 5’ overhang that is 
homologous to the vector/insert. The second step involves addition of RecA 
and adenosine triphosphate (ATP), which can recombine the DNA fragments 
together into a single plasmid: any nicks generated are fixed after transformation. 
The Gibson assembly method [11] exploits a specific exonuclease to chew back 
the 5’ end to generate single-stranded complementary overhangs and ligases that 
are incorporated in the reaction mix to seal the DNA nicks. The DNA fragments 
are PCR amplified with 15-30 bp of homologous DNA regions to the 5’ and 3’ 
adjacent DNA sequences. In a single reaction, both vector and insert are sub- 
jected to T5 exonuclease that chews back the 5’ ends of the DNA fragments, and 
then the polymerase and ligase combine the homologous ends of fragments to a 
single circular DNA molecule. 


Table 16.1 Summary of different advanced DNA methods that could be used for combinatorial 
library generation. 


Method Type of reaction 

SLIC Exonuclease-based overhang generation and in vivo ligation 
BioBrick standards Step-wise modular restriction digestion and in vitro ligation 
Golden Gate Type Ils restriction enzyme digestion and in vitro ligation 
DNA assembler In vivo homologous recombination 

Gibson assembly Exonuclease-based overhang generation and in vitro ligation 


CPEC Overlap extension PCR 
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Homologous recombination is successful in DNA assembly, but basic poly- 
merase extension mechanisms have also shown to be successful in the CPEC 
method to assemble DNA fragments into a plasmid [12]. The insert and vector 
are fused in an overlap extension PCR and circularize with extended overlapping 
stands, leaving only a nick in each strand. Then Escherichia coli repairs the nicks 
in vivo when transformed. 

Another family of advanced DNA assembly techniques has been developed 
via the implementation of the type IIS endonucleases such as Bsal, which cleave 
the DNA outside of their recognition sites, resulting in 5’ or 3’ DNA overhangs of 
nearly any user-defined nucleotide sequence [13, 17]. This strategy is more 
advanced than traditional restriction digestion/ligation method because it allows 
more flexibility in insertion location than cloning into the MCS ona plasmid. Use 
of type IIS endonucleases through the Golden Gate assembly method is a one- 
step reaction, which combines restriction digestion and ligation. This method has 
a high fragment assembly efficiency and proven to be effective in creating gene 
libraries [17]. A continuing area of research with this technique is investigating a 
more modular approach for pathway and pathway library construction [18, 19]. 

The need for modularity in gene and pathway cloning is becoming more sig- 
nificant with recent focuses on high-throughput DNA assembly and automation. 
One of the most established strategies for assembly standardization is the 
BioBrick system [14, 20-24]. The BioBrick and Bg]Brick standards (such as vec- 
tors, promoters, and RBS) rely on isocaudomer pairs of restriction enzymes to 
generate compatible cohesive ends and, upon ligation, result in a scar sequence 
that cannot be cleaved by either of the original restriction digests. DNA frag- 
ments flanked with these recognition sequences can be used for modular assem- 
bly of a pathway by iterative digestions and ligations. 

Consideration of which assembly strategy to use for the generation of pathway 
libraries will greatly depend on the chassis, number of DNA fragments, and 
required assembly efficiency. In vivo homologous recombination is especially 
useful if the pathway is being expressed in S. cerevisiae. However, Gibson assem- 
bly and BioBrick standards are very useful if working in E. coli. Many DNA frag- 
ments to be assembled in the library can greatly decrease the assembly efficiency, 
which should be considered if a complex pathway is being investigated. If assem- 
bly efficiency is limiting, a strategy that allows for longer homology or linker 
region can be applied. Though no studies have linked library size to assembly 
strategies, some of the previous strategies might limit the library size, which 
could reduce the potential search space. Biases in assembly toward a certain gene 
or promoter can also reduce the potential search space. It is important to ensure 
that the library is diverse and random clones exhibit all potential genotypes of 
the library. One-pot assembly is also an important consideration, as iterative 
assemblies can be time consuming and can also reduce the potential library size. 


16.3. Generation of Pathway Libraries 


Combinatorial pathway library screening strategies, as compared with tradi- 
tional pathway engineering strategies, can be more efficient in the identification 
of an optimized pathway. Traditional strategies optimize individual components 
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of the pathway one at a time to increase flux to the desired product [25-27], but 
pathway library screening strategies can tune multiple components of the 
pathway simultaneously. By varying multiple constituents concurrently, the like- 
lihood of obtaining an optimized flux via balanced gene expression and protein 
activity within the pathway is increased. A more comprehensive exploration of 
the potential diversity of a target pathway can be achieved, which could identify 
unexpected synergistic effects [28, 29]. Many pathway optimization strategies 
are based on gene expression by varying promoter strength or RBS engineering. 
It is also possible to balance the flux through the pathway by exploring various 
combinations of enzymatic properties such as catalytic efficiency, cofactor 
specificity, stability, and substrate specificity. Currently, there are several exam- 
ples of pathway libraries constructed through different advanced DNA assembly 
methods. 


16.3.1 In vitro Assembly Methods 


The Gibson assembly method was applied to generate a large combinatorial 
library of promoters and enzymes. The proof-of-concept pathway was the heter- 
ologous acetate utilization pathway in E. coli, comprised of an acetate kinase 
(ackA) and a phosphotransacetylase (pta) [30]. This combinatorial library was 
based on three promoter sequences with assorted strengths and four ortholo- 
gous variants of both genes, generating 144 possible unique combinations of the 
promoters and genes. Each gene cassette was synthesized with an RBS, a termi- 
nator, and the promoter/gene variant. A unique 40-bp DNA linker sequence con- 
tains homologous DNA directly upstream and downstream of the gene at the 
terminal ends of the cassette (Figure 16.2). This linker region was used to ensure 
proper pathway sequence during assembly. 

The total library size was approximately 10*, affording 70-fold coverage of the 
144 possible combinations. Investigation of the assembly efficiency showed that 
over 80% (30/37) of the selected clones harbored a correctly assembled pathway. 
Further sequencing analyses showed that of the thirty correctly assembled path- 
ways, 60% (18/30) had recognizable promoter sequences. Of the possible 144 
promoter/gene combinations, 14 unique combinations were present in the 18 
positively identified pathways. A bias was noted toward a specific combination of 
genes from certain organisms, even though each gene fragment was assembled 
in equal combinations. This bias could have been the result of an assembly bias, 
or it could be the result of a screening bias, as the library was screened on acetate 
and these genes could be the most efficient for acetate utilization in E. coli. 

The Gibson assembly was also used by Coussement and coworkers in another 
example of creating a combinatorial library of transcription, translation, and 
protein sequence variability [31]. This strategy utilized a single-stranded assem- 
bly to introduce diversity in the double-stranded DNA of the promoter, RBS, 
and/or coding sequences. Optimization of the assembly found that two oligonu- 
cleotide fragments of similar lengths provided a nearly 100% efficiency of 
assembly. More DNA fragments or fragments of different lengths lowered the 
assembly efficiency. Promoter, RBS, and protein libraries using a single gene 
were all proven to have a large linear range and had diverse expression and 
activity. The assembly was tested for combinatorial pathway libraries using the 


335 


Promoters ackA Promoters pta 


Gibson assembly -_ -. L eeeeemenel — — 
enzyme and promoter Eom easements mae ae 
library ss. bemasiententeapipasaanl aoe SR 
— ARTE 
40-bp linker 40-bp linker 40-bp linker 
Promoter and terminator BGL CDT 

—_ —— 

DNA assembler ‘om H i F => ——, .— 

promoter library a er ll TT ee 

_ .——- 


I 400-bp homologous region 


Gene 
XR XDH XKS 


f=] Promoter and terminator —= -— ——-— 
90000008) — Be000000088 «86 BBOOODCOCE 
“ORSINI ee 


DNA assembler es H | : 
enzyme library omologous region “ . 
400-bp homologous 400-bp homologous 


=z region region 


BGL Terminator for BGL and promoter for CDT CDT 
ee oe with no mutations ee 
Enzyme error-prone a _ Ea — 
library ss sas oe beeen 
EES Se 
40-bp homologous region 40-bp homologous region 


Figure 16.2 Preparation of DNA fragments for large library generation. Each unique design represents a unique 
promoter or gene. Varied strength promoters, orthologous genes, or mutated pathway components generate 
diversity. If the DNA is assembled with homology regions, upstream and downstream of the DNA fragment of 
interest, the pathway can assemble properly into many different combinations. Each strategy has incorporated 
different lengths of homology, which can contribute to the efficiency of correct assembly. These DNA fragments 
are then subjected to the desired DNA assembly reaction with the linearized vector and transformed into the 
host. 
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reporter genes mKate2 and sfGfp. The promoter library was tested using this 
fluorescent pathway. One hundred and eighty-eight clones were randomly 
picked and profiled for complete representation of the potential expression 
landscape. Theoretical library size was 4°” and these 188 clones revealed a good 
representation of diversity. This pathway optimization is based on short frag- 
ment assembly of 50-150-bp assembly and has not been applied to larger DNA 
fragments. This strategy is efficient for shorter promoter regions of E. coli and 
point mutations of targeted protein engineering, which is one of the preferred 
strategies for protein engineering. However, pathways incorporating diversity in 
larger DNA fragments such as yeast promoters and other protein engineering 
strategies could not be accomplished through this current method. 

The Gibson assembly was also utilized by Lee et al. to optimize a multienzyme 
pathway in the absence of a high-throughput assay [32]. This study took the 
pathway libraries assembly one step further and incorporated computational 
modeling to reduce the large search library that must be screened. For assembly, 
standardized vectors were constructed based on principles of the BglBrick-style 
cloning of protein fusions. The expression cassettes were flanked by pairs of 
homology sequences (20bp) derived from yeast barcodes to allow for correct 
sequence assembly. Each promoter used was proven to work independently of 
DNA sequence directly downstream of it. Three-gene library assemblies resulted 
in 25-33% miss-assembly. Library assembly was tested in a three-gene fluores- 
cent protein library, with a theoretical library size of 125. The triple library was 
shown to cover the complete three-dimensional expression space. 

To apply this assembly to a pathway and construct a predictive model, the five- 
gene violacein biosynthetic pathway was utilized, resulting in a theoretical com- 
binatorial library size of 3125. Ninety-one random transformants from the 
colony were characterized for geno- and phenotypic data. A linear regression 
model was then constructed from this data and used to predict optimal pheno- 
types. The authors suggest that a low sampling rate of 1-2% of the library could 
be sufficient for generating a predictive model. Four models were constructed for 
different intermediates and branched products of the violacein pathway. The 
model predictions and empirical data were high, with Pearson correlation coef- 
ficients being between 0.77 and 0.92 for the specific targets. The model was used 
to predict the top five expression-level combinations. These combinations were 
individually cloned and tested to determine if the desired product had increased 
production with the predicted expression levels. The model was able to predict 
and identify the expression level to yield the desired product with the highest 
production from the pathway. 

A BioBrick-like assembly strategy was used in a combinatorial library of engi- 
neered RBSs [33]. This iterative assembly process utilizes the chloramphenicol 
resistance cassette paired with the library of RBS sequences. The resistance cas- 
sette is flanked by restriction digests and then can easily be removed to incorpo- 
rate the next target gene and RBS library. To determine if the strategy could yield 
a library that spanned a multidimensional expression space, three reporter genes 
were used in a synthetic operon: CFP, YP, and mCherry. The RBS modulation 
was shown to span 100-fold in each dimension of the expression space. The 
seven-gene carotenoid biosynthesis pathway with the end product of astaxanthin 
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was used as a proof-of-concept study for this assembly in pathways. The theo- 
retical library was 6’ possible RBS combinations, and nearly 25000 clones were 
visually screened, which is only 10% of the potential library. Through visual 
screening of the colonies’ color of astaxanthin, 500 colonies were picked for fur- 
ther analysis. Fifty clones were identified to have the most intense color and 
screened for highest astaxanthin production through high-performance liquid 
chromatography (HPLC). This strategy yielded a clone with fourfold higher asta- 
xanthin production than the wild-type pathway. 

The aforementioned studies have all involved random, large pathway libraries. 
A new BioBrick standard platform, the ePathBrick system, allows for assembly of 
specific pathways, with the ability to vary specific components [34]. The ePath- 
Brick system is a pathway fine-tuning toolkit that consists of a number of 
BioBrick-compatible plasmids with characterized regulatory signal elements. 
With this system, Xu and coworkers demonstrated a modular engineering 
approach for significant titer improvement of a multi-gene fatty acid metabolic 
pathway by fine-tuning gene expression through plasmid copy number and RBS 
engineering [35]. The E. coli fatty acid biosynthetic pathway was apportioned 
and overexpressed in three separate modules. These modules were successfully 
expressed on compatible ePathBrick vectors with varying plasmid copy num- 
bers. The total fatty acid production was optimized by overexpressing each mod- 
ule on high, medium, or low copy number plasmids. Nine independent pathways 
were constructed through the ePathBrick standards and analyzed for fatty acid 
production. As has been noted before in product titer, the highest gene expres- 
sion is not always optimal [6]. The greatest increase in fatty acid production 
occurred only when the final module was expressed highly, combined with a 
lower expression in the other modules. The balanced gene expression pathway 
produced a fourfold increase in fatty acid titer compared to the lowest-producing 
pathway. Similarly, three different strength RBSs were also tested in the modules, 
and a balance between strong and medium strength RBSs improved fatty acid 
production by twofold. This type of strategy can illuminate bottlenecks in the 
pathway. This study exemplified the importance of high concentrations of malo- 
nyl-CoA in fatty acid production. 

A randomized BioBrick strategy has also been developed, which combines the 
power of Gibson assembly and the modularity of the BioBrick standards [36]. In 
this method, all promoters, RBSs, and transcriptional terminators were rand- 
omized within the pathway. These modular DNA fragments were derived from 
PCR-amplified BioBricks, and each component was cloned with 18—28-bp link- 
ers of homologous DNA regions to the 5’ and 3’ DNA. Three promoters, three 
RBSs, and three terminators were simultaneously randomized for the three-gene 
pathway for the lycopene biosynthetic pathway, generating a library of nearly 
20000 unique clones. The library was assembled through Gibson assembly and 
was screened on plates for the orange-colored lycopene product. Of the red— 
orange colored colonies, 12 were selected, and DNA sequencing analysis demon- 
strated that 7/8 randomized pathways were distinct and four pathways had 
deletions. The study cautions the metabolic burden placed on the cells during 
the library screening that could have caused the mutations. 


16.3 Generation of Pathway Libraries 


16.3.2 In vivo Assembly Methods 


E. coli does not have robust and efficient homologous recombination machinery; 
therefore in vitro assembly methods are highly needed. In contrast, plants and 
yeast have very vigorous and efficient homologous recombination machinery, 
allowing for facile pathway library creation in vivo. Two divergent strategies for 
in vivo homologous recombination have been developed: chromosomal integra- 
tion and plasmid assembly. 


16.3.2.1_ In vivo Chromosomal Integration 

Wingler and Cornish established a reiterative recombination method for the in 
vivo assembly of multi-gene pathway libraries directly into the chromosome [37]. 
The strategy utilized a pair of alternating orthogonal endonucleases and selecta- 
ble markers. Homologous recombination and gap repair were used to construct 
a plasmid containing the gene of interest, marker, and endonuclease, which were 
recombined into an acceptor strain. This acceptor strain carries a predefined 
target locus for integration into the chromosome. Galactose-induced expression 
of the endonuclease cleaves the double-stranded DNA, triggering the homolo- 
gous recombination and leading to integration of the gene of interest and the 
auxotrophic marker into the chromosome. The strains are then selected for the 
new auxotrophic marker and cured against excess donor plasmid. The proof of 
concept for pathway integration and mock library assembly was demonstrated 
using the lycopene biosynthetic pathway (crtE, crtB, and crtl). A large library of 
over 10* was assembled: the mock library contained various ratios of crtB and 
crtI alleles that contained either nonsense or silent mutations, which would pro- 
duce working or interrupted pathways. The diversity could be judged based on 
the actual and theoretical percentages of working pathways versus interrupted 
pathways, visualized based on the color of the colonies on the plate. Each library 
had the expected percentage of working pathways, indicating a non-biased 
library assembly into the chromosome. 

Pathway library strategies have also been established in plant biotechnology to 
study secondary metabolites [38, 39]. Engineering secondary metabolism in 
plants can be a daunting task considering the complexity of the target pathways, 
which could have multiple branches, multifunctional and/or compartmentalized 
enzymes, and complex feedback inhibition. Zhu et al. established a novel method 
for the combinatorial nuclear transformation of multiple genes into a plant, gen- 
erating a pathway library to simplify the study of multiple variables of secondary 
metabolites [38, 39]. Carotenoid production in cereal grains was used as a proof 
of concept. Embryos of the cereal-grain white maize were bombarded with metal 
particles coated with six unique constructs, consisting of a selection marker and 
five carotenogenic genes. The resultant library consisted of any combination of 
one or more expression phenotype from any of the five genes. This method of 
multiple gene transformation and pathway library screening allowed the identi- 
fication of rate-limiting steps in the carotenogenic pathway. Total carotenoid 
production in cereal grains was improved 140-fold based on a unique combina- 
tion identified from this multi-gene pathway library strategy. 
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16.3.2.2  Invivo Plasmid Assembly and One-Step Optimization Libraries 
Chromosomal integration has been successful in pathway library creation, but 
assembling the pathways into a plasmid is also advantageous. A plasmid is a 
DNA molecule that can be easily transported across strains, which is an impor- 
tant characteristic to consider when excluding the possibility that the observed 
improvements are not a result of off-target genome modification. 

An example of plasmid-based pathway libraries was constructed by the DNA 
assembly method and focused on a combinatorial library of different promoter 
strengths for all the genes within the library [40]. As a proof of concept in path- 
way library generation, the xylose and cellobiose utilization pathways for ethanol 
production were optimized. Efficient utilization of these biomass sugars is criti- 
cal for economically feasible biofuel production. Promoters PDC1, ENO2, and 
TEF1 were mutagenized through nucleotide analog-based error-prone PCR to 
induce a very high mutation rate and produce promoters of various strengths. 
After mutagenesis, mutants for each promoter were assayed through fluores- 
cence protein expression, and 10 promoters of defined strengths were selected 
for library construction. These 10 promoters in each position of the library 
resulted in a theoretical library size of 10° and 10° for the cellobiose and xylose 
utilization libraries, respectively. Each mutant promoter was cloned into a 
helper plasmid that contained 400-bp sequences homologous to the 5’ DNA 
region (Figure 16.2). The mutant promoter/gene expression cassettes were co- 
transformed into a yeast strain with a total library size of 10°. To confirm the 
diversity of the library, over 40 individual colonies from each library were 
screened from an antibiotic selection marker for plasmid-pathway assembly and 
not based on sugar utilization. Each colony from this plasmid marker selection 
exhibited a unique growth curve on its respective carbon source, which was 
indicative of a diverse library. 

Improved sugar utilization was visualized in a high-throughput manner by 
inspection of colony size on agar plates, wherein larger colony sizes were sugges- 
tive of faster sugar utilization and improved growth. In the xylose utilization 
pathway, a very efficient mutant pathway was identified in a single step. This 
pathway conferred a xylose consumption rate of 0.73g1"'h™', comparable with 
some of the fastest xylose consumption rates from strains that had been sub- 
jected to multiple generations of optimization strategies. The strain harboring 
the wild-type pathway did not produce any ethanol, while the mutant pathway 
conferred an ethanol productivity of 0.17g1"'h”!. In the cellobiose utilization 
strategy, the strain harboring the optimized pathway yielded a 5.4-fold improved 
cellobiose utilization rate and a 5.3-fold increase in ethanol productivity. 

A similar pathway library strategy created a combinatorial library of homolo- 
gous enzymes of the xylose utilization pathway, with fix-strength promoters [41]. 
The fungal xylose utilization pathway has been shown to be especially sensitive 
to cofactor imbalances and unbalanced enzyme expression [42-45]. A total the- 
oretical library size of 8360 possible unique combinations of homologous 
enzymes for each of the five genes in the pathway was constructed through 
homologous recombination. Each enzyme was characterized to show varied 
activities and cofactor dependencies. The gene sequences were cloned into 
helper plasmid expression cassettes, containing promoters, terminators, and at 
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least a 400-bp region homologous to the 5’ and 3’ DNA regions of the pathway at 
the termini of the expression cassette (Figure 16.2). The expression cassettes 
were transformed into the three different yeast strains with an average library 
size of 1.3 x 10*. To confirm library diversity and screening for optimal pathways, 
the same strategies established in the promoter-based library were applied [40]. 
Sequencing results of random colonies showed that all the genes were recogniz- 
able with no major mutations or hybrids, resulting in a 100% efficiency, and there 
was no significant bias toward a certain gene. The same library was screened in 
three different strains, and a unique combination of genes was discovered to be 
optimal in each individual strain. This unique combination for each strain is 
attributed to the different metabolic background of the strains and availability of 
precursors or cofactors. 


16.3.2.3. Invivo Plasmid Assembly and Iterative Multi-step Optimization 

Libraries 

Directed evolution, an iterative multistep optimization strategy, is an established 
strategy that is a very powerful technique in synthetic biology for optimizing 
protein activity [5, 46]. Application of the strategy has been expanded to include 
pathway-scale transcriptional engineering and protein engineering through the 
following pathway library studies. The directed evolution strategy on the path- 
way scale is particularly powerful because it allows for the optimal flux to be 
identified with no a priori information about pathway bottlenecks or specifics 
about the pathway enzymes. This directed evolution strategy on the pathway 
scale allows for all components to be screened/selected for a balanced activity, 
not just for high activity. 

Yuan and coworkers applied directed evolution to mutant promoter path- 
way libraries of the cellobiose utilization [47]. An average mutation rate of 
12-16-nucleotide substitutions per kilobase for each mutagenized promoter was 
obtained. The pathway genes were not mutagenized, and these non-mutated 
DNA fragments were co-transformed with the error-prone promoter library and 
a linearized vector for a total library size of 10*. The pathway phenotype improve- 
ment was assessed by fast sugar utilization, visualized by large colonies on agar 
plates. The first round of directed evolution identified a strain with a 5.7-fold 
increase in cellobiose consumption rate and a 5.5-fold increase in ethanol pro- 
ductivity. The further rounds of evolution yielded incremental subsequent 
increases (Figure 16.3). After characterizing the mutant promoters, it was found 
that the expression level ratios had significantly changed. While the parent 
BGL:CDT (-glucosidase/cellodextrin transporter) relative expression ratio was 
13.8: 1, the first round of mutagenesis altered the ratio to 2.5:1. This significant 
increase in relative CDT expression suggested that this protein expression was a 
bottleneck. 

Pathway-scale protein engineering strategies were also applied using homolo- 
gous recombination [48]. In this study, both the BGL and CDT proteins were 
coevolved for balanced activity in a directed evolution manner. One amino acid 
substitution per protein was introduced through error-prone PCR, yielding a 
theoretical total library size of 9.9x10°. No gene expression elements were 
mutagenized in this strategy and therefore were PCR amplified into the pathway 
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Figure 16.3 Fermentation profiles of the evolutionary rounds for the pathway libraries. (a,b) Cellobiose 
consumption and ethanol production of the cellobiose utilization pathway from the promoter-based directed 
evolution. The black square represents the parent pathway with no mutations in the PDC7 and ENO2 promoter. 
The circles are the first round of error-prone PCR of both promoters. The triangles represent the second and final 
rounds of directed evolution mutagenesis. (c,d) Cellobiose consumption and ethanol production of the 
cellobiose utilization pathway from the protein-based directed evolution. The black square represents the 
wild-type pathway with no mutations in the B-glucosidase and the cellodextrin transporter. The circle is the first 
round of error-prone PCR of both proteins. The triangle represents the second round of directed evolution. 
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separately (Figure 16.2). The total library size screened was 10* and was screened 
for strains harboring a pathway that conferred a fast growth on cellobiose, 
visualized through large colonies on cellobiose agar plates. In this study, two 
rounds of directed evolution identified a mutant pathway that conferred a 47% 
increase in growth rate on cellobiose and a 64% increase in ethanol productivity 
(Figure 16.3). As all proteins of the pathway were coevolved, mutations were 
found in each protein from every round and characterized to understand why 
the pathway conferred an improved phenotype. The BGL mutants were shown 
to have improved cellobiose specificity and activity. The CDT mutants had an 
overall higher activity, associated with a higher Vmax. 


16.4 Conclusions and Prospects 


Advanced DNA assembly methods have allowed scientists and engineers 
extraordinary freedom in constructing pathways, greatly facilitating advances in 
pathway library generation. Pathway optimization through whole pathway librar- 
ies has expanded the potential diversity and possibilities for improving pathway 
phenotype. Furthermore, high efficiency and modularity of these advanced DNA 
assembly methods make in silico design [49] and automated assembly [50] of 
these libraries possible. Large combinations of library components can be indi- 
vidually constructed by robotic platforms and investigated by high-throughput 
screening for extensive investigations of improved pathway phenotypes. Despite 
the rapid progress of DNA assembly technologies, widespread application of 
pathway libraries is currently limited by high-throughput screening. Without the 
ability to easily and economically quantify the phenotype of interest, these large- 
scale pathway libraries will not be able to fulfill their maximum potential. Future 
high-throughput screening methods could be realized through microfluidic 
devices, with the ability to screen up to 10° clones per day [51, 52]. Biosensors 
also have potential in high-throughput screening, as shown by a number of tran- 
scription factor-based biosensors that have been engineered to detect small mol- 
ecules. These biosensors can link the small molecule concentration to an easily 
measurable signal such as fluorescence and cell growth via gene circuits [53-56]. 
Though there are challenges, the potential of using advanced DNA assembly 
methods to create pathway libraries to significantly improve microbial cell pro- 
duction of fuels and chemicals is significant, and future pathway engineering 
methods will benefit from these strategies. 


Definitions 


Pathway Coordinated heterologous and/or endogenous enzymatic reactions 

Pathway engineering A research area that specializes in modifying or optimiz- 
ing components of an enzymatic pathway for improved phenotype 

Pathway optimization Strategies to improve the overall performance of an 
enzymatic pathway 
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Pathway libraries A collection of mutant enzymatic pathways wherein multiple 


components (RBS, promoters, enzymes) within the pathway have simultane- 
ously been mutated 


Directed evolution An evolutionary process for engineering biological systems 


that mimics Darwinian evolution in vitro and in vivo: rounds of random muta- 
tions are incorporated into the DNA sequence and selected for improved phe- 
notype in an iterative fashion 


DNA assembly The process to conjoin several DNA fragments to create a large 


DNA molecule 
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17.1 The Need for a New Therapeutic Paradigm 


The advent of the germ theory of disease in the late nineteenth century marked a 
watershed in the history of medicine and heralded the development of modern 
pharmaceuticals. The work of Louis Pasteur, Robert Koch, and fellow microbiolo- 
gists elucidated the bacterial and viral origins of common and often fatal diseases 
such as cholera and puerperal fever and motivated the development of myriad small 
molecule-based pharmaceuticals and viral vaccines that specifically targeted infec- 
tious agents. As epidemics such as smallpox and polio came under control, new 
classes of diseases that do not have simple biological causes gradually took center 
stage. Chronic and complex illnesses such as diabetes, cardiovascular diseases, and 
cancers supplanted infectious diseases as the dominant scourges in developed 
countries after the Second World War. In response to this changing landscape of 
medical challenges, pioneers in molecular biology and genetic engineering launched 
a new paradigm of pharmaceutical development and, beginning in the 1980s, pro- 
duced the first biologics: monoclonal antibodies such as trastuzumab (Herceptin) 
[5, 6] and recombinant protein therapeutics such as synthetic insulin and erythro- 
poietin [7, 8]. Today, the twin pillars of small molecules and biologics continue to 
serve as the pharmaceutical arsenal of modern medicine, complemented by non- 
biochemical methods such as medical devices and surgical intervention. 

Despite modern advancements in diagnostics and therapeutics, several debili- 
tating diseases have remained essentially incurable. In particular, cancer has 
steadily risen through the ranks of fatal diseases over the past several decades, 
with prominent examples including pancreatic and small cell lung cancers, each 
with an overall relative 5-year survival rate of 7% [9]. Glioblastoma, the most 
common type of primary brain tumors, has a median survival period of less than 
15 months [10, 11]. Unlike well-characterized infectious diseases and metabolic 
disorders such as diabetes, the conditions highlighted previously do not present 
simple biological causes or deficiencies that can be easily eliminated or compen- 
sated by chemical drugs and biologics. Cancer cells are characterized by genomic 
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Table 17.1 Major categories of cell-based immunotherapies and application areas currently 
under investigation. 


Cell type Major application areas 

Effector and memory T cells Cancers, viral infections 

Regulatory T cells Autoimmune diseases, inflammatory diseases 
Myeloid-derived suppressor cells Autoimmune diseases, inflammatory diseases 
Dendritic cells Cancer vaccines 

Natural killer cells Cancers, viral infections 


instabilities that enable them to escape individual therapeutic strategies through 
genetic hypermutation [12]. Furthermore, in cases such as glioblastoma, dis- 
eased cells can be situated in a protected niche (e.g., behind the blood-brain bar- 
rier (BBB)) that is both inaccessible to chemical and biological therapeutics and 
incompatible with complete surgical resection [13]. Finally, diseased cells often 
closely resemble healthy tissues on the surface and lack unique molecular mark- 
ers that allow precise identification by drug molecules. As a result, strategies 
including chemotherapy and antibody therapeutics often lead to severe off-tar- 
get or “on-target, off-tumor’” toxicities [5, 14]. 

Complex, dynamic diseases call for a new category of therapeutics that can 
actively sense and process multiple input signals and respond to changing 
disease landscapes with multipronged therapeutic outputs [1-4]. Cellular thera- 
pies represent a new platform for the treatment of currently intractable diseases. 
In particular, cell-based immunotherapy has made major strides in the past dec- 
ade in the treatment of cancer, viral infections, and autoimmune diseases [15—23] 
(Table 17.1). In August 2017, T cells that have been genetically modified to express 
tumor-targeting chimeric antigen receptors (CARs) became the first gene therapy 
to gain approval from the U.S. Food and Drug Administration (FDA) for cancer 
treatment, highlighting the potential of cellular therapy as a novel treatment 
option for advanced malignancies. 


17.2 Rationale for Cellular Therapies 


Cellular therapies — that is, the use of living cells as the therapeutic agent — have a 
number of distinctive properties that are well suited to the treatment of complex, 
dynamic diseases. First, mobile living cells are significantly more versatile than 
single molecules in the type and number of effector functions that can be exe- 
cuted. Cellular therapeutics can be engineered to serve both as independent 
actors that directly eradicate diseased cells or infectious agents and as payload 
carriers that deliver therapeutic molecules to a targeted site. For example, cyto- 
toxic T cells expressing surface-bound receptors that direct T cells toward tumor 
antigens have shown clinical efficacy in treating melanoma [24, 25] and B-cell 
leukemia [26-29] through direct killing of cancer cells. Antitumor functions can 
be further enhanced by cellular engineering, such as decorating T-cell surfaces 
with nanoparticles to specifically deliver drug molecules to the immunological 
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synapse [30, 31] or stably integrating T cells with DNA constructs that encode 
for immunostimulatory cytokines under the control of constitutive or inducible 
promoters [32, 33]. Similarly, genetically engineered stem cells have been pro- 
grammed to deliver cytotoxic molecules, angiostatic factors, and immunostimu- 
latory cytokines to tumor cells [34-36], demonstrating the versatility and 
programmability of living cells as therapeutic agents. 

Second, unlike static drug molecules, cellular therapeutics can be genetically 
programmed to conditionally and dynamically deliver functional outputs in 
response to the presence of specific inputs, thereby increasing therapeutic speci- 
ficity and efficacy. For example, T cells are naturally programmed to execute 
functions ranging from cytotoxicity to immune recruitment only upon encoun- 
tering target cells that express antigens recognized by the T-cell receptor (TCR). 
T-cell functions vary dynamically with time and are closely coordinated with the 
rest of the adaptive immune system, thus enabling a finely modulated response 
to disease and infection. In addition to natural TCRs, synthetic CARs that mimic 
TCR function and redirect T-cell specificity toward disease targets that are oth- 
erwise non-immunogenic have shown great promise in clinical trials [26-29]. 
Furthermore, T-cell activation can in turn serve as the trigger for downstream 
effector outputs. For example, by transgenically expressing the immunostimula- 
tory cytokine interleukin-12 (IL-12) gene under the NFAT (nuclear factor of acti- 
vated T cells) promoter, researchers have generated melanoma-reactive T cells 
that produce IL-12 only upon T-cell activation, thus avoiding the need for sys- 
temic IL-12 injections and associated toxicities [33]. As living entities, therapeu- 
tic cells have the ability to perform sense-and-respond functions that greatly 
enhance treatment specificity and reduce toxic side effects. 

Third, unlike chemical pharmaceuticals and biologics, cellular therapeutics have 
the potential to establish prolonged proliferation in the patient and provide con- 
tinual surveillance against disease relapse without repeated drug administration. 
Long-term persistence of therapeutic cells has been shown to be critical in main- 
taining complete remission across cancer types in adoptive T-cell therapy [37, 38], 
highlighting the importance of this unique characteristic of cellular therapies. 

Despite these important advantages, cellular therapeutics still face major chal- 
lenges in achieving the level of safety and efficacy required of frontline treatment 
options. The use of living cells as therapeutic agents invokes a level of complexity 
not previously seen with traditional pharmaceutical development, and the ability 
to precisely engineer and stringently regulate therapeutic cells is a critical need 
that must be fulfilled in the rise of cellular therapy. The following sections discuss 
some of the challenges facing cell-based therapeutics — particularly cell-based 
immunotherapies — and highlight solutions that have been developed through 
the application of synthetic biology. 


17.3. Synthetic Biology Approaches to Cellular 
Immunotherapy Engineering 


The programmability of living cells to perform diverse functions -— natural or 
engineered, constitutive or modulated by regulatory systems—is a defining 
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characteristic and significant advantage of cellular therapies. Viewing cells as 
chasses, synthetic biologists have demonstrated that biological functions can be 
rationally designed, systematically optimized, and translated across organisms 
[39]. A core competency of synthetic biology is the rapid construction, integra- 
tion, and characterization of biological systems, leading to well-defined, ration- 
ally engineered cell products. This approach to cell engineering has generated 
early examples with potential therapeutic functions and converges with work 
that has been well established in the field of cellular therapeutics [40—43]. 

Immune system engineering has played a dominant role in cellular therapies. 
The application of cell-based immunotherapy can be broadly divided into two 
categories: immunosuppressive and immunostimulatory. Immunosuppressive 
therapies aim to dampen aberrant immune responses that characterize inflam- 
matory and autoimmune diseases such as multiple sclerosis, inflammatory bowel 
diseases, and organ transplant rejection [44, 45]. For example, regulatory T cells 
and myeloid-derived suppressor cells are naturally immunosuppressive cell types 
under intensive investigation as treatment options for conditions ranging from 
ocular inflammation to stroke-induced cerebral ischemia [46, 47]. In contrast, 
immunostimulatory therapies aim to boost immune responses against infectious 
agents and tumor growths. Prominent examples in this category include the use 
of natural killer (NK) cells and cytotoxic T cells that directly kill diseased cells, as 
well as dendritic cells that stimulate immune responses by presenting disease- 
associated antigen peptides to effector cells including T cells and B cells [48-50]. 
The engineering of immune cells provides ample opportunity for synthetic biol- 
ogy to make a real impact on the improvement of health and medicine. 


17.3.1 CAR Engineering for Adoptive T-Cell Therapy 


Adoptive T-cell therapy is an emerging treatment paradigm in which T cells 
expressing either TCRs or CARs that target specific disease markers are expanded 
ex vivo prior to infusion into a patient (Figure 17.1). These systemically adminis- 
tered T cells have the ability to seek and destroy target cells that display the cog- 
nate antigen, thereby serving as a living drug against otherwise intractable 
diseases such as refractory cancers and posttransplantation viral infections 
[51, 52]. In particular, the adoptive transfer of T cells that express anti-CD19 
CARs has shown remarkable curative potential against advanced B-cell malig- 
nancies, achieving up to 90% complete remission rate in the treatment of acute 
lymphoblastic leukemia [27, 28, 53]. 

The development of CARs offers an example of a synthetic biological approach to 
efficient cell therapy engineering. CARs are synthetic receptors that redirect T-cell 
specificity toward diseased targets, such as virally infected or cancerous cells, that 
do not naturally provoke robust immune responses from endogenous T cells. CARs 
are fusion proteins in which antibody-derived single-chain variable fragments 
(scFvs) serve as extracellular sensing domains and are fused (via extracellular spacer 
sequences and transmembrane domains) to cytoplasmic CD3¢ signaling domains 
derived from the natural TCR [54] (Figure 17.2a). When the CAR is expressed by 
conventional T cells, ligation of the scFv domain to cognate antigens triggers signal- 
ing through the CD3€ chain, leading to T-cell activation and unleashing cytotoxic 
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Figure 17.1 Schematic of adoptive T-cell therapy. Endogenous tumor-infiltrating lymphocytes 
(TILs) are T cells with natural tumor reactivity and can be isolated from tumor biopsies, 
expanded ex vivo, and reinfused into the cancer patient. Alternatively, non-tumor-reactive 

T cells can be isolated, genetically modified to express a tumor-reactive T-cell receptor (TCR) 
or chimeric antigen receptor (CAR), expanded ex vivo, and reinfused into the patient. 


activity toward the target cell. Second- and third-generation CARs have further 
incorporated costimulatory domains such as CD28 and 4-1BB to enhance T-cell 
effector functions [55-58]. The modular nature of CARs is highlighted by the diver- 
sity of targets that can be recognized by simply replacing the scFv while retaining 
essentially the same transmembrane and cytoplasmic domains [59]. 

Although the first CARs predate the emergence of synthetic biology as a disci- 
pline, CAR engineering is highly compatible with the synthetic biology approach 
to biological system design and construction. Taking advantage of the modular 
composition of CAR molecules, researchers have systematically probed the rela- 
tionship between CAR structure and T-cell function by characterizing panels of 
related CAR molecules [60], a process that has been greatly facilitated by the 
advent of high-throughput DNA synthesis and assembly techniques. For exam- 
ple, studies using a combinatorially constructed panel of CAR molecules have 
demonstrated that the optimal length of the extracellular spacer in CARs is con- 
tingent upon the size of the antigen presented by the target cell [61]. The effects 
of adding different costimulatory signals and extracellular spacers to CARs have 
also been explored by combinatorial cloning of CAR molecules [57, 62]. 

Beyond elucidating principle design rules, this bottom-up approach to CAR 
engineering has been further leveraged to yield receptors that can execute 
Boolean logic and regulate T-cell activation according to simultaneous or 
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sequential antigen encounter. Such computational capability offers a means to 
increase the safety and efficacy of adoptive T-cell therapy by addressing critical 
clinical challenges, including imperfect targeting specificity and vulnerability 
to antigen escape (i.e., a process by which diseased cells escape T-cell detection 
by downregulating the expression of targeted antigens). 

For example, three recent clinical trials reported that 38—100% of patient 
relapses after CD19 CAR-T cell therapy were characterized by the loss of CD19 
expression [28, 63, 64]. To address the problem of antigen escape, bispecific 
OR-gate CARs that incorporate two scFv domains have been developed. T cells 
armed with OR-gate CARs can respond to either of two distinct antigen inputs, 
thus reducing the probability that a tumor cell can successfully escape detection 
via mutational loss of antigen expression [65, 66] (Figure 17.2b). This principle 
has been applied to generate an optimized CD19/CD20 bispecific CAR that ena- 
bles cytotoxic T cells to effectively eliminate cancerous B cells that have lost 
CD19 expression [66, 67]. Specifically, T cells expressing the bispecific CAR are 
able to not only eradicate established lymphoma in mice but also prevent tumor 
relapse, whereas animals treated with conventional, single-input CD19 CAR T 
cells succumb to cancer recurrence caused by antigen escape [66, 67]. Additional 
combinations such as CD19/CD22 are also under active preclinical evaluation 
[68], and they promise to significantly increase the efficacy of CAR-T cell therapy 
against heterogeneous and/or genetically unstable tumors. 


< 


Figure 17.2 CARs redirect T-cell specificity toward tumor targets. (a) Schematic of first-, second-, 
and third-generation CARs. The single-chain variable fragment (scFv) derived from a tumor- 
antigen-specific antibody serves as the extracellular sensing domain, and the cytoplasmic tail of 
the CD3¢ chain serves as the intracellular signaling domain of the CAR. In second- and third- 
generation CARs, one or two costimulatory domains such as CD28 and 4-1BB are directly fused 
to the CD3¢ chain to enhance T-cell signaling. (b) Schematic of single-chain, bispecific OR-gate 
CARs. T cells expressing an OR-gate signal processing system can kill any target cell that 
expresses either antigen A or antigen B. (c) Schematic of an AND-NOT-gate CAR pair. The first 
receptor is a conventional CAR that targets antigen A. The second is a chimeric inhibitory 
receptor (iCAR) that targets antigen B and contains the cytoplasmic domain of an inhibitory 
receptor (e.g., PD-1 or CTLA-4). Presence of antigen A triggers CAR signaling, while presence of 
antigen B triggers iCAR signaling. The inhibitory function of the iCAR overrides any activation 
signal that may result from the conventional CAR, thus executing A-NOT-B signal computation. 
(d) Schematic of an AND-gate CAR pair. The first receptor is a conventional first-generation CAR 
that targets antigen A and contains only the CD3¢ chain without costimulatory signals. The 
second is a chimeric costimulatory receptor that targets antigen B and contains both CD28 and 
4-1BB costimulatory signals but no CD3¢ chain. Both antigens must be present to trigger a 
sufficiently robust T-cell response to execute therapeutic function. (e) Schematic of a“remote- 
controlled” CAR system. Here, the CAR protein is split into two parts, with the first fragment 
being a conventional CAR that contains the FK506 binding protein (FKBP) instead of the CD3¢ 
chain at the C-terminus. The second fragment consists of a membrane-tethered CD3¢ chain 
fused to the FKBP-rapamycin binding (FRB). Presence of a rapamycin analog (rapalog) molecule 
triggers dimerization between FKBP and FRB, thereby reconstituting a full CAR protein and 
enabling CAR signaling in response to antigen binding. (f) Schematic of a synthetic Notch 
(synNotch) receptor-regulated CAR expression system. Upon binding to antigen A, the synNotch 
receptor releases a TF, which translocates to the nucleus and triggers CAR expression from a 
cognate promoter. This CAR molecule is subsequently able to trigger T-cell activation upon 
binding to antigen B, resulting in AND-gate signal computation in a sequential manner. 
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The modular nature of CAR signaling has also spurred the development of a 
number of dual-CAR systems that trigger T-cell activation only if two conditions 
are simultaneously satisfied, in essence executing AND-gate or AND-NOT-gate 
computations that aim to improve targeting specificity by therapeutic T cells. 
In one example, researchers designed inhibitory chimeric antigen receptors 
(iCARs) by replacing the CD3¢ domain of a prostate-specific membrane antigen 
(PSMA)-targeting CAR with intracellular signaling domains from inhibitory 
receptors such as cytotoxic T-lymphocyte-associated protein 4 (CTLA-4) and 
programmed death 1 (PD-1) (Figure 17.2c) [69]. Upon recognition of PSMA, 
inhibitory signaling through the iCAR effectively competed against activating 
signaling by a second-generation CD19 CAR to limit T-cell proliferation, 
cytokine secretion, and cytotoxicity, thereby achieving “CD19-AND-NOT- 
PSMA’ signal computation [69]. It is important to note that NK cells naturally 
express both activating and inhibitory receptors, and insights to be gained from 
greater understanding of NK cell signaling may also serve to instruct the robust 
development of iCARs. 

Building on the clinical observation that both the CD3¢ chain and costimula- 
tory signals are necessary to achieve in vivo antitumor responses, another study 
described a dual-receptor system in which the first receptor targets the prostate 
stem cell antigen (PSCA) and contains only the CD3¢ chain without costimula- 
tory signals, while the second is a chimeric costimulatory receptor that targets 
PSMA and contains both CD28 and 4-1BB costimulatory signals but no CD3¢ 
chain (Figure 17.2d). After testing several anti-PSCA scFv domains with varying 
binding affinities, researchers were able to generate a pair of receptors that 
trigger T-cell activation and effectively control tumor growth in vivo if and only 
if the tumor expressed both PSCA and PSMA [70]. An interesting alternative 
approach is to segregate the extracellular scFv from the CD3¢ chain until 
reconstitution via small molecule-induced heterodimerization. A recent study 
reported the construction of ON-switch CARs by incorporating the rapamycin 
analog (rapalog)-inducible heterodimerization domain FK506 binding protein 
(FKBP) into a truncated second-generation CD19 CAR lacking the CD3¢ chain. 
Separately, the FKBP-rapamycin binding (FRB) domain was fused to the cyto- 
plasmic portion of CD3¢ (Figure 17.2e) [71]. As such, a fully functional CAR 
containing the ligand-binding scFv domain and the T-cell-activating CD3¢ chain 
is only generated upon the addition of rapalog, which induces dimerization 
between FKBP and FRB, thus bringing the two system components into close 
proximity. T cells expressing the ON-switch CAR were able to proliferate and 
mediate cytotoxicity upon target-cell encounter, but only in a rapalog dose- 
dependent manner, thus yielding temporal control over CAR activation via small 
molecule drug administration [71]. 

Yet another example of synthetic receptor design repurposes the signaling 
mechanism of the Notch receptor. Antigen binding by the Notch receptor 
exposes a juxtamembrane cleavage sequence that undergoes proteolysis by the 
intramembrane protease gamma-secretase, a processing step that releases the 
intracellular Notch domain to the nucleus to serve as a transcription factor (TF) 
that drives gene expression programs. Utilizing a modular design approach anal- 
ogous to CAR engineering, researchers developed synthetic Notch (synNotch) 
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receptors comprised of an extracellular scFv domain fused to a synthetic TF via 
the endogenous Notch transmembrane domain and juxtamembrane cleavage 
sequence [72] (Figure 17.2f). Upon ligand binding, the synNotch receptor under- 
goes cleavage and releases the synthetic TF to drive gene expression from a 
cognate inducible promoter. By placing CAR expression under this transcrip- 
tional control, a dual-receptor system enables T cells to perform AND-gate com- 
putation in a sequential manner-—that is, antigen A triggers the synNotch 
receptor to drive expression of the CAR, and subsequent recognition of antigen 
B by the CAR activates T-cell effector functions. Pairing a green fluorescent pro- 
tein (GFP) synNotch with a CD19 CAR enables T cells to effectively eliminate 
tumor cells expressing both GFP and CD19, but not CD19 alone [73]. 

Although these strategies underscore the vast potential of applied synthetic 
biology toward enhancing therapeutic efficacy and specificity, CAR performance 
is still subject to design rules that require better understanding. Multiple recent 
studies have implicated important CAR components, such as the framework 
region of scFv domains and the non-signaling extracellular spacer, in triggering 
tonic signaling [74, 75]. Moreover, in each of the examples highlighted previ- 
ously, multiple iterations of receptor design were required to identify the correct 
combination of “modular” components to achieve robust system performance. 
As more data become available through systematic studies of CAR design param- 
eters, a more quantitative, rational approach to next-generation CAR design will 
begin to supplant what has largely been a trial-and-error method in engineering 
CAR-T cells for disease treatment. 


17.3.2 Genetic Engineering to Enhance T-Cell Therapeutic Function 


Robust proliferation and persistence of T cells have been shown by multiple clin- 
ical trials to be both critical to therapeutic efficacy and difficult to achieve in vivo 
[32, 57, 58]. Consequently, there have been many attempts to prolong the sur- 
vival of CAR-expressing T cells via genetic engineering. These approaches can 
be broadly grouped into strategies that promote immune stimulation and those 
that counteract immune suppression. Within the former category, researchers 
have engineered “armored” T cells to overexpress immunostimulatory cytokines 
including IL-2, IL-12, and IL-15, thus sustaining T-cell proliferation and effector 
function [76, 77] (Figure 17.3a). Transgenic expression of costimulatory mole- 
cules such as 4-1BB ligand (4-1BBL) and CD40L or surface receptors including 
interleukin-7 receptor « (IL-7Ra), CCR4, and CXCR2 has also been shown to 
mitigate T-cell exhaustion and promote T-cell persistence [78-82]. 

Even when armored with supportive cytokines and costimulatory signaling, 
engineered T cells can still become exhausted or rendered dysfunctional by 
repeated antigen stimulation or sustained exposure to immunosuppressive fac- 
tors located in the tumor microenvironment. To overcome this challenge, exten- 
sive research has focused on disrupting endogenous inhibitory signaling 
pathways (Figure 17.3b) or rewiring immunosuppressive inputs to immunostim- 
ulatory outputs (Figure 17.3c). Notably, clinical administration of monoclonal 
antibodies targeting inhibitory checkpoint molecules such as CTLA-4 and PD-1 
has been shown to alleviate immunosuppression of naturally tumor-infiltrating 
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lymphocytes (TILs) and restore T-cell function in cancer patients [83-85]. While 
the successes of these treatments have led to US FDA approval of antibodies such 
as ipilimumab and pembrolizumab, checkpoint blockade therapies can only be 
effective when tumor-reactive T-cell clones already exist in the patient’s system. 
To address cancer types that are not naturally immunogenic, researchers have 
engineered tumor-targeting T cells to express dominant-negative receptors 
(DNRs) that compete with endogenous receptors for binding to tumor-associated 
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Figure 17.3 Synthetic biological constructs and circuits enable controlled enhancement of 
T-cell function. (a) “Armored” T cells are engineered to overexpress costimulatory ligands, 
cytokine receptors, chemokine receptors, or immunostimulatory cytokines that can boost 
T-cell proliferation, persistence, and effector functions in an autocrine manner. Once secreted, 
immunostimulatory cytokines (e.g., IL-2, IL-12, IL-15, etc.) can also signal in paracrine fashion 
to trigger the recruitment, growth, and antitumor responses of native immune cells. (b) T cells 
can be genetically modified to resist inhibitory signals present on tumor cells (e.g., PD-L1) or 
within the tumor microenvironment (e.g., TGF-B) by expressing dominant-negative receptors 
(DNRs) or knocking out inhibitory receptors. DNRs lack signal transduction domains and 
competitively sequester immunosuppressive ligands away from native inhibitory receptors. 
Genetic knockout of inhibitory receptor expression abrogates receptor-mediated recognition 
of immunosuppressive factors, thus reducing T-cell dysfunction and exhaustion. (c) Inverted 
cytokine receptors (ICRs) are fusions between the extracellular ligand binding of an inhibitory 
receptor and the intracellular signaling domain of an immunostimulatory receptor. Encounter 
with immunosuppressive cytokines (e.g., IL-4) in the tumor microenvironment activates 
expression programs that enhance T-cell proliferation, persistence, and effector functions. 
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Figure 17.3 (Continued) 


molecules such as tumor growth factor-beta (TGF-8) or PD-1 (Figure 17.3b) 
[86-90]. These genetically modified T cells can resist immunosuppression and 
retain greater in vivo antitumor activity and are currently under clinical evalua- 
tion (NCT00889954). The rapid progression of clustered regularly interspaced 
short palindromic repeat (CRISPR)/Cas9 and mammalian genome editing tech- 
nologies has also enabled the corollary approach of ablating immunosuppressive 
signaling by knocking out inhibitory receptor expression (Figure 17.3b). Indeed, 
the first FDA-approved clinical trial involving CRISPR/Cas9-edited cells will 
examine the use of T cells that had been genetically modified to knockout PD-1 
expression [91, 92], and a similar trial has already begun accruing patients abroad 
(NCT02793856). 

To further combat the effect of tumor-mediated immunosuppression, 
researchers have sought to actively invert suppressive cues to promote T-cell 
activation. One example of signal inversion was accomplished by fusing the 
extracellular ligand-binding domain of the inhibitory IL-4 receptor to the intra- 
cellular signaling domain of the immunostimulatory IL-7 receptor [93, 94] 
(Figure 17.3c). Expression of the resulting IL-4/IL-7 inverted cytokine receptor 
(ICR) reversed the suppression of PSCA-CAR T-cell activity in culture condi- 
tions mimicking the tumor milieu of pancreatic cancers [94]. Similarly, another 
study demonstrated that a chimeric receptor that converts PD-L1 binding to 
CD28 costimulation could elevate the effector functions of low-avidity T cells 
to levels observed in high-avidity T cells, suggesting a method to boost the ther- 
apeutic efficacy of T cells previously deemed unsuitable for adoptive T-cell 
therapy [95]. 


17.3.3. Generating Safer T-Cell Therapeutics with Synthetic Biology 


Although extensive research has focused on improving the efficacy of cellular 
immunotherapy, safety remains the paramount priority in therapeutics develop- 
ment. Even precisely engineered cells retain the possibility of mutation after 
prolonged periods of expansion inside the host organism. Similarly, sustained 


359 


360 


17 Synthetic Biology in Immunotherapy and Stem Cell Therapy Engineering 


interference with the intricate balance between immunostimulatory and immu- 
nosuppressive signaling creates inherent risks for autoimmune dysfunction 
[96-98]. Safety concerns thus demand gene expression control systems that can 
be regulated by physician-administered drugs or by molecules specific to the 
tumor microenvironment. To address this challenge, several ligand-responsive 
regulatory systems have been developed to control the production of potent 
cytokines or suicide proteins by engineered T cells [99, 100]. In an early example 
of mammalian synthetic biology, small molecule-responsive ribozyme switches 
were inserted in the 3’ untranslated region of transgenes encoding IL-2 and 
IL-15, resulting in posttranscriptional control of cytokine production in a rapid, 
reversible manner in vitro as well as ligand-dependent modulation of T-cell pro- 
liferation in vivo [99]. Notably, the ribozyme switch is modularly composed with 
well-defined sensing, actuating, and information-transmission domains that can 
be independently modified for the specific application of interest. For example, 
RNA aptamers to a wide variety of ligands including nucleic acids, small mole- 
cules, and proteins have been generated in vitro [101], and ribozyme switches 
tailored for ligands specific to the disease of interest can be rationally designed 
by incorporating the appropriate RNA aptamers. In the context of cytokine 
regulation for T-cell therapy, ribozyme switches can be designed to respond to 
physician-administered drugs or to molecules known to be overexpressed by 
tumor cells, thus increasing the specificity and safety of this cell-based therapeu- 
tic strategy. 

As an alternative to the regulation of growth-promoting cytokines, the expres- 
sion of suicide genes that can rapidly and precisely eliminate engineered T cells 
provides a means to prevent runaway immune responses. The most commonly 
used suicide gene is the herpes simplex virus I-derived thymidine kinase 
(HSV-TK). Originally developed as a method to deplete donor T cells that cause 
graft-versus-host disease after allogeneic bone marrow transplantation, HSV-TK 
expression confers sensitivity toward the small molecule drug ganciclovir, thus 
enabling selective depletion of T cells that have been engineered to transgeni- 
cally express HSV-TK [102]. However, HSV-TK-mediated cell depletion is often 
incomplete, and the strategy precludes the use of ganciclovir as an antiviral drug 
for cytomegalovirus infections, a common and often fatal complication of bone 
marrow transplants [103]. Taking an alternative approach, researchers have 
developed chimeric suicide genes that fuse pro-apoptotic proteins with dimeri- 
zation domains to induce apoptosis upon the administration of a chemical 
ligand [104]. For example, an inducible caspase 9 suicide system has been con- 
structed by fusing an inactive pro-caspase 9 monomer to FKBP [105]. Upon 
administration of the chemical inducer of dimerization (CID) molecule AP1903, 
the FKBP domains homodimerize, resulting in the cross-linking and activation 
of the tethered caspase 9 domains, which in turn induce apoptosis in cells 
expressing this suicide system (Figure 17.4). The inducible caspase 9 system has 
been tested in an adoptive T-cell therapy trial and demonstrated the ability to 
eradicate >90% of engineered T cells within 30 min of AP1903 administration, 
effectively eliminating graft-versus-host disease symptoms without recurrence 
[100]. Combining several of the technologies summarized previously, research- 
ers have engineered second-generation CD19 CAR-T cells equipped with 
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Figure 17.4 A chemically inducible caspase 9 kill switch. Inactive pro-caspase 9 monomers 
are linked to the human FK506 binding protein FKBP and constitutively expressed in the 
engineered cell. Upon addition of the chemical inducer of dimerization AP1903, the FKBP 
domains dimerize and lead to the cross-linking and activation of caspase 9, which triggers 
downstream events in the apoptosis pathway and results in cell death. 


constitutive IL-15 production and the inducible caspase 9 suicide system and 
demonstrated superior in vivo T-cell expansion and antitumor effects compared 
with T cells expressing the CAR alone [106]. 

Although suicide gene systems provide a powerful countermeasure to major 
adverse events such as deleterious genetic mutations in engineered cells, results 
from clinical trials have also highlighted situations in which measured dampen- 
ing of functions rather than complete elimination of therapeutic cells is the 
preferred response. In adoptive T-cell therapy for cancer, tumor regression is 
strongly associated with a dramatic increase in the level of inflammatory 
cytokines, a phenomenon known as cytokine storm or tumor lysis syndrome 
[107, 108]. When the intensity of the tumor lysis syndrome exceeds physiological 
tolerance, corticosteroids can be administered to the patient, thereby not only 
quelling the immediate dangers of therapy-associated toxicity but also terminat- 
ing the treatment by effectively disabling the therapeutic cell population [109]. 
As a potential alternative, researchers have engineered synthetic circuits that 
regulate the amplitude of T-cell activation, thus enabling fine-tuning of T-cell- 
mediated responses [110]. The bacterial protein OspF downregulates T-cell 
activation by inactivating the extracellular signal-regulated kinase (ERK). An 
“amplitude limiter” consisting of a negative feedback loop with OspF expressed 
from an NFAT promoter lowers the maximum level of T-cell activation-induced 
gene expression, which can be further modulated by the addition of degradation 
tags to the OspF protein. Furthermore, a “pause switch’ was constructed by 
expressing OspF from a doxycycline-inducible promoter, such that pulses of 
doxycycline addition result in temporary reductions in T-cell activation-induced 
expression [110]. 
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In addition to the magnitude of the immune response, the precision of effector 
functions is critical to the development of safer cell-based immunotherapy. For 
example, there is an emerging consensus that the lack of tumor-exclusive, sur- 
face-bound antigens presents a fundamental challenge to the widespread imple- 
mentation of adoptive T-cell therapy [111, 112]. T cells identify target cells 
via surface receptor-mediated recognition of membrane-bound biomarker. 
However, tumor cells rarely express surface antigens that are completely absent 
in all healthy tissues. As a result, basal antigen expression by healthy tissues fre- 
quently elicits “on-target, off-tumor” toxicities in adoptive T-cell therapy. 
Bispecific CAR-T cells capable of AND- or AND-NOT-gate signal computation 
discussed previously represent one approach to increasing the precision of dis- 
ease-cell recognition based on surface interactions [69-73]. Another recently 
reported approach endows T cells with the ability to interrogate the intracellular 
environment of target cells through the delivery of intracellular antigen-respon- 
sive cytotoxic switches derived from the cytotoxic proteins granzyme B (GrB) 
[113]. Expression of a small ubiquitin-like modifier (SUMO)-—GrB fusion protein 
selectively triggered cytotoxicity in cells overexpressing the intracellular tumor- 
associated sentrin-specific protease 1 (SENP1) [113]. Coupled with demonstra- 
tions of recombinant GrB transfer from T cells into target cells, these results 
point to a potentially viable strategy for improving the therapeutic precision of 
adoptive T-cell therapy by expanding the repertoire of targetable candidate anti- 
gens to include a plethora of intracellular disease signatures. Although these sys- 
tems remain to be validated in vivo, they highlight the versatility and malleability 
of cellular therapeutics, as well as the importance of effective engineering tech- 
niques in the development and optimization of cell-based immunotherapy. 


17.4 Challenges and Future Outlook 


The ability to efficiently design, construct, and optimize synthetic biological sys- 
tems that modify and/or interface with living cells is expanding new possibilities 
in the development of cellular therapeutics and offering enticing views of next- 
generation strategies for disease treatment. Early synthetic biological circuits 
consist of input/output devices linked in various configurations to achieve 
diverse purposes, including signal oscillation, memory, and cell-cell communi- 
cation [114-117]. If engineered to fit the clinical context and application-specific 
requirements, such functions could significantly improve the performance of 
cellular therapeutics. For example, a robust, tunable oscillation pattern would 
enable the regular, pulsatory delivery of drug molecules that are either synthe- 
sized or carried by therapeutic cells. The ability to memorize and keep count of 
events such as cell divisions would enable timed proliferation and death of engi- 
neered cells, providing an additional mechanism to ensure the safety of cellular 
therapies. The ability to sense extracellular molecular signals and communicate 
with other cells could enable time-, position-, and community-dependent res- 
ponses that serve as disease diagnostics or enhance the specificity of cellular 
therapeutics toward disease targets. In addition to synthetic circuitry that con- 
fers novel functions onto engineered cells, rapidly advancing genome editing 
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technologies are poised to enable the generation of universal, off-the-shelf 
cellular therapeutics that lack antigenic markers to induce immune rejection, a 
development that would significantly reduce the time and financial costs associ- 
ated with producing personalized supplies of therapeutic cells for each patient 
[118]. If efforts in whole-genome synthesis and the construction of artificial cells 
come to fruition, cellular therapeutics may eventually consist of fully synthetic 
cells with precisely controlled functions. 

Despite the myriad possibilities that synthetic biology inspires, real obstacles 
need to be overcome in moving from model systems to real-world applications 
in health and medicine. First, most synthetic biological systems demonstrated 
to date have been designed to function in microorganisms such as yeasts and 
bacteria rather than mammalian cells. Although some studies have shown 
transportability across organisms [99, 119], significantly more experience will 
be required in mammalian cell engineering to achieve the level of efficiency in 
system assembly, integration, and characterization that is now possible in 
microorganisms. 

Second, despite the variety of synthetic circuits that have been reported, a rela- 
tively small number of parts (e.g., the tet-inducible promoter, the theophylline 
aptamer, fluorescent protein outputs, or acyl-homoserine lactone (AHL)-based 
quorum sensing components) have been reused in a large number of designs, 
reflecting a need to expand the inventory of biological parts. In particular, 
cellular therapeutics development will require new outputs that execute thera- 
peutic functions at precisely defined activity levels [120], a significantly more 
complex task than ON/OFF control of fluorescent protein outputs. Similarly, 
new sensors need to be developed to respond to therapeutically relevant inputs 
such as disease-associated metabolites or FDA-approved drugs rather than oft- 
used but clinically unacceptable inputs such as theophylline or isopropyl B-p-1- 
thiogalactopyranoside (IPTG). 

Third, given the paramount importance of safety in medical applications, any 
synthetic system applied to cellular therapeutics must perform with consistency 
and precision in the face of heterogeneities that are inevitable in the human body 
and particularly in diseased cells. Unlike model systems in which parameters 
such as input ligand concentration and cell density can be precisely controlled, 
clinical applications in which heterogeneous cell populations harvested from 
patients need to be quickly genetically modified, expanded, and reinfused in bulk 
into the patients require a high level of robustness such that the system would 
generate predictable and consistent outputs without the benefit of extensive cell- 
population refinement or well-defined ranges of input signal strength. In this 
regard, researchers are actively investigating genetic engineering strategies that 
can enable synthetic components to interface more robustly with host cell physi- 
ology. In contrast to random insertion of CAR transgenes via viral transduction, 
site-specific integration of aCD19 CAR into the T-cell receptor « chain (TRAC) 
locus of primary human T cells resulted in antigen-stimulated regulation and 
greater uniformity of CAR expression [121, 122]. These site-specifically modi- 
fied T cells exhibited reduced tonic signaling, delayed T-cell exhaustion, and 
enhanced antitumor potency [121], underscoring the importance of being able 
to tune the expression level and signaling strength of synthetic systems. 
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Finally, as synthetic biologists build increasingly complex systems, the twin 
issues of scaling and implementation must be addressed. Current strategies in 
circuit design generally result in a roughly linear relationship between part num- 
ber/size and functionality. For example, an RNAi-based cancer-cell identifier has 
been demonstrated to distinguish HeLa cells from a number of other cancer cell 
lines by sensing the levels of six distinct microRNAs (miRNAs) through a net- 
work of constitutive and inducible promoters linked to genes encoding various 
inducer proteins and miRNA target sites [123]. This work provided an elegant 
example of a synthetic, multi-input system in mammalian cells applicable to cel- 
lular therapeutics engineering. However, adding each new miRNA input would 
require a significant increase in the footprint and complexity of the computation 
network without the benefit of economy of scale, leading to problems of parts 
shortage (i.e., there are limited numbers of inducible promoters available) and 
low integration efficiency (i.e., the system will eventually be too large to be deliv- 
ered and stably integrated in the cell). This challenge of scalability will have to be 
resolved before a broad range of cellular therapeutics with the necessary level of 
functional complexity can be developed through synthetic biology. 

Cellular therapeutics has generated a tremendous amount of excitement in 
recent years, particularly in cancer immunotherapy. The cell engineering tech- 
niques and circuit design expertise being developed through synthetic biology 
are poised to make timely and significant contributions to the continuing 
improvement of cellular therapeutics. Important challenges remain to be 
addressed in biological system design and implementation methods, and the 
accumulating knowledge from ongoing efforts in synthetic biology will be criti- 
cal in the construction of synthetic biological systems with real-world applica- 
tions in health and medicine. 
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Definitions 


Cellular therapy The use of living cells, as distinct from chemical pharmaceuti- 
cals or biologics, as therapeutic agents in the treatment of diseases 

Immunotherapy Disease treatment that modulates the immune system to 
enhance immune responses against disease agents or diseased cells 

Adoptive T-cell therapy A type of cellular immunotherapy in which autolo- 
gous T cells with specificity toward disease targets, including cancerous and 
virally infected cells, are expanded ex vivo and reinfused into the patient. 
T cells harvested from the patient may have endogenous disease-specific 


References 


reactivity, or they may be genetically modified to express disease-targeting 
receptors prior to expansion and reinfusion 

Chimeric antigen receptor (CAR) A class of membrane-bound fusion proteins 
that redirect T-cell specificity toward specified target antigens. First-generation 
CARs consist of four major domains: an extracellular single-chain variable 
fragment (scFv) that determines target specificity, an extracellular spacer typi- 
cally derived from immunoglobulin molecules, a transmembrane domain, and 
the cytoplasmic signaling domain of the CD3¢ chain that triggers T-cell activa- 
tion upon ligand binding to the scFv. Second- and third-generation CARs con- 
tain one or two additional cytoplasmic costimulatory domains, respectively, 
that enhance T-cell activation. The most commonly used costimulatory 
domains are CD28 and 4-1BB 

Suicide gene Genes encoding for protein products that lead to cell death either 
directly (e.g., by triggering the apoptosis pathway) or indirectly (e.g., by pro- 
cessing prodrugs to lethal products through enzymatic reactions) 
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18.1 Introduction 


This chapter summarizes recent developments in the field of public perception 
and public engagement and the attempt to apply the concept of “responsible 
research and innovation” (RRI) to synthetic biology (SB). Albeit the term 
synthetic biology—in its contemporary version—has been around for about a 
decade, the field itself can be considered as a continuous development of genetic 
engineering (GE), a research field established in 1970s [1], although the term 
synthetic biology itself was coined already in 1910 by French scientist Stephane 
Leduc. GE is defined as “the intentional manipulation of an organism’s genetic 
material using tools that cut, move, and reattach (recombine) DNA segments 
within and across different organisms” [1]. SB is developed based on the experi- 
ence and knowledge of GE [2], yet tools and approaches of SB differ from those 
of GE as SB attempts to build more sophisticated biological systems [3]. Thus, SB 
can be seen as the second edition of GE, GE 2.0, as a “new way to organize and 
construct the art of genetic engineering” that “enforces the traditional engineer- 
ing concepts of modularity and standardization and adapts logical operator 
structures from information processing” [4]. Since the early onset of this tech- 
nology, the GE has faced a lot of skeptics from different stakeholders, including 
the research community itself, nongovernmental organizations (NGOs), and 
regulatory bodies. In the early GE development, the public as well as scientists 
shared similar concerns on how to conduct the research. Along with the growth 
and development of GE, oversight efforts have been developed to address these 
concerns at least since 1975 [5]. 

Although SB uses recombinant DNA techniques to engineer genetic circuits, 
parts, devices, and the whole systems, it differs from GE. In GE, the principle 
approach is more likely a “copy and paste” of the naturally existed traits from 
“donor” to “recipient.” Yet as a GE 2.0 version, SB is enabling scientists and engi- 
neers more freedom to “compose” contents based on design. 

This would entail a deeper metabolic engineering [6], the definition of a mini- 
mal genome [7-9], the construction of protocells [10, 11], and the creation of 
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noncanonical biochemistries [12-17]. SB is an interdisciplinary research field, 
involving scientists across both science and engineering [4], while GE is known 
to be a discipline of life science. Another difference between SB and GE is the 
consideration of societal concerns at a very early stage of development. Generally 
speaking, the fact that the world today, with the Internet and social media, civil 
society groups with diverse needs and concerns, means that we encounter a very 
different Zeitgeist today than in the 1970s. But what does this mean for SB? Will 
SB be just another reenactment of the GE debate from the 1980s and 1990s, or 
will the debate be carried out in a totally different way? 

In this chapter, the public perception on SB and the societal ramifications of its 
applications will be reviewed. We will analyze how public perceptions toward SB 
have developed over the years. Then we will look at the contingencies that frame 
the debate about the technology with a special emphasis on the comparative sci- 
ence and engineering fields. Last but not least, in order to address some of the 
concerns raised within the open dialogues on SB, the idea to carry out RRI in SB 
will be introduced. 


18.2 Public Perception of the Nascent Field 
of Synthetic Biology 


According to many scientists and funding agencies, SB is believed to hold great 
potential for applications in multiple economic areas and thus may have signifi- 
cant ramifications for society. 

Learning from the history of GE, especially with regard to genetically modified 
(GM) crops in Europe, the opinions of the prospective end users and end 
consumers cannot be ignored in SB. Even the most techno-optimistic engineer 
realizes that he is not working in a societal vacuum but is part of a societal fabric 
that relates not only to public funding decisions but also to the way the research 
is done in the lab. Compared to GM crops or the human genome project, where 
the technology was developed first and then the implications for society, which 
were discussed rather “downstream, SB demonstrates a new paradigm where 
societal issues are placed more “upstream.” The idea is that strong concerns and 
objections would appear on the radar screen early on and “appropriate” meas- 
ures could be taken to deal with it, instead of fully developing a technology in 
total ignorance of its societal reaction, having to risk the burial of a whole suite 
of technologies and wasting millions of taxpayers R&D money (as in the case of 
GM crops in Europe). What appropriate means in this context is another impor- 
tant aspect, which will be discussed under Section 18.4. 

One way to find out what “the public thinks” about SB (or any other new and 
emerging technology) is by conducting public perception surveys. This could 
be phone call or face-to-face interviews or written questionnaires (sometimes 
complemented by focus groups, where about a dozen people have a moderated 
discussion). The advantage of this approach is the relative ease with which to get 
some first data. The downside, however, is that the results can only be regarded 
as a rough momentous observation and a deeper understanding of the rationale 
behind those perceptions is not always possible. 


18.2 Public Perception of the Nascent Field of Synthetic Biology 


Although SB is still a rather young field, a number of projects and social 
science research groups have carried out studies on public perception in Europe 
and the United States. These investigations have been conducted with different 
methods, ranging from phone surveys and focus group studies to public 
dialogues, while the sample sizes of these studies were also varied. It is worth to 
point out that it is difficult to do quantitative comparisons on these data. 
What we intend to do here is thus merely a summary of these findings on public 
perception on SB. 


18.2.1 Perception of Synthetic Biology in the United States 


From year 2008 to 2010, three consecutive surveys were conducted by the Hart 
Research Associates; and another one was in 2013 [18]. These surveys provided 
findings on what were the public perceptions on SB. 


Awareness of the technology: The public awareness of SB has increase steadily. 
Those who heard a lot or some of the technology increased from a bit <1 in 10 
earlier to nearly 1 in 4 now (9% in year 2008, 22% in year 2009, 26% in year 
2010, and 23% in year 2013). This trend of awareness of the technology might 
reflect the development in the research field. The promise of SB in harnessing 
biomass to useful products [6, 19-21] and the creation of the first synthetic 
cell made it to the headlines of mainstream media [22]. The public exposure to 
newspaper articles or other media types increased especially in 2010, the year 
of the Venter Institute publication of the so-called synthetic cell. While in the 
recent years, as no “thrilling” media news came out of SB, the result was a 
slightly less marked public awareness. 

Imaging the technology: Result from the survey of year 2013 showed that nearly 
one third (31%) of the public surveyed associated the science with something 
unnatural, man-made, and artificial. 15% linked the science to generate new 
life via genetic manipulation. The rest was on possible applications in medical 
science (10% on prosthetics and 6% for new medicine), agriculture (6%), and 
basic science (5%). The linkage between SB and something man-made and 
artificial might be resulted from the term “synthetic, which is traditionally 
linked to man-made chemicals. It might also be the result from the channel 
the public learned about the technology, for example, from the media coverage 
on synthetic cells. 

Risk and benefits of the technology: Based on the level of awareness of the tech- 
nology, those who heard nothing showed higher uncertainty in judging the 
risk—benefit issue (49% vs 23% of those who heard a little and 18% of those 
who heard a lot or some). From those who heard about the technology, a lot, 
some or a little, a majority considered the risks and benefits of equal impor- 
tance (37% and 40%, respectively). For those who heard a lot or some, the posi- 
tive thinking (28% of net benefit overweight) was more than the negative 
thinking (17% of net risk overweight). After providing information about SB, 
the uncertainty reduced (from 27% down to 5%) from the people in the 
informed group, yet the negative thinking increased from 15% before informed 
to 33% post informed. It suggested that the public formed their judgment ona 
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new technology at least partially based on information they learned. With the 
limited information (just by short information without further supporting 
information), the public tended to be more skeptical toward the new comer. 
This finding somehow runs against the belief often voiced by SB scientists that 
the more the public knows about the technology, the more they like it. The 
surveys show that this is not automatically the case. 


Oversight of the technology: By comparing the survey result of year 2013 with year 


2010, it showed that there is a shift of public opinion on how SB should be 
regulated. More people considered voluntary research guideline as adequate: 
43% of them in year 2013 rising from the 36% in year 2010. A detailed decision 
on the opinions in year 2013 showed that those who favored government 
regulation had more confidence in federal government to maximize benefits/ 
minimize risks (59%), while those who favored voluntary guideline showed 
only 33% confidence on government regulation. Although lack of consensus 
on how SB should be regulated, the majority of people (two thirds) showed 
support for SB research instead of placing a ban due to lack of information on 
risks (one third). This attitude remained the same in the latest two surveys 
(years 2010 and 2013). More support for the technology to go ahead (88%) 
came from people who held the opinion that benefits outweighed risks, while 
those who believed risks outweighing benefits were keen to ban the research 
(61%). Regarding the most problematic issues, the ranking was as follows: 
potential to create biological weapons (28%), moral concern to create artificial 
life (27%), harm to human health (20%), and damage to the environment (12%). 
An interesting finding from the latest survey was that there was very low 
awareness of the do-it-yourself biology (DIYBio) movement among the public 
(only 7%), although this is a grassroots movement supposed to encourage 
public engagement in research through so-called citizen scientists. 


Other studies: Besides the surveys mentioned earlier, the public attitudes toward 


SB from these surveys were further analyzed [23-25]. Pauwels summarized 
the two clear findings from the SB surveys [24]. The first is that most people 
know little or nothing about SB. Second, notwithstanding this lack of knowl- 
edge, respondents are likely to venture some remark about what they think SB 
is and the trade-off between potential benefits and potential risks. This is 
common for the public perception on other technologies as well due to science 
literacy. Analogous to cloning, GE and stem cell research were recurrent in the 
dissemination of SB in the science publications and the public outreach mate- 
rials. More frames and comparators of SB will be reviewed in Section 18.3. The 
potential applications seem to be another decisive factor in shifting public 
perception of SB. Finally, the acceptance of the risk—benefit trade-off of SB 
seems to depend on an oversight structure that would manage the unknowns, 
the human and environmental concerns, and their long-term effects. It showed 
that additional investigations were needed to identify other factors that would 
shape public perceptions about SB, its potential benefits, and its potential 
risks. Comparison between the US survey and the UK public dialogue was 
conducted and that the awareness of SB grew significantly in the United States 
while the UK dialogue indicated a “conditional support” for SB [26]. The devel- 
opment of public perception on SB was also studied by comparing the trends 
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in press coverage of SB in the United States to those in parts of Europe [25]. It 
showed that news stories in the United States mentioned more potential 
benefits (51%, news coverage from year 2003 to 2008) of SB than potential 
risks (44%), while the European presses mentioned more risks (59%, from year 
2003 to 2007) than potential benefit (28%). 


18.2.2 Perception of Synthetic Biology in Europe 


18.2.2.1 European Union 

The public attitudes toward biotech and the life sciences in Europe have been 
assessed by the Eurobarometer surveys. The recent Eurobarometer on this topic 
was conducted in 2010 based on representative samples from 32 European coun- 
tries [27, 28]. The analysis on the survey showed that the people in Europe were 
largely unaware of SB—only 17% of those participated in the survey heard of 
SB- which means a low level of awareness. Regarding GE in general, there were 
concerns on products, particularly food from the GE technology [28, 29]. Among 
these concerns, there were common perceptions that the GE food was probably 
unsafe or even harmful; there was also concern on safety due to possible horizon- 
tal gene transfer. However, the public attitudes toward novel technology were not 
totally negative. The survey showed the public believed that research on biofuels 
(an application developed by SB approaches) should be supported. The primary 
concern on SB was the information about the possible risk (63%). A majority of 
the public would also like to know more about the claimed benefit (52%). Other 
concerns were who would benefit and who would bear the risks (40%), scientific 
progress in the field (31%), regulation (29%), funding (24%), and societal issues 
(16%) [28]. Due to the unawareness of the technology and more enthusiasm for 
the novel field, the public considered the regulation of SB should be science 
based (left for the scientific experts) but with the necessary oversight from the 
authority; however, when ethics and social values were involved, the public 
involvement should be included in decision-making [29]. When asking how SB 
should be regulated, more than half preferred scientific evidence (52%) over 
moral or ethical issues (34%). And the public preferred more expert advices for 
the decision about SB (59%) than what the majority (lay people) would think 
(29%). A majority (77%) agreed that SB should be tightly regulated by the govern- 
ment [28]. Within 2014-2015, the expert committees from European Commission 
(EC) conducted three public consultations related to SB, covering issues on the 
definition on SB, risk assessment methods and safety aspects, and SB-related 
risks to the environment and biodiversity and research priorities [30-32]. The 
reports from these public consultations, although the opinions were most likely 
from closely related stakeholders in the field, paved way for further dissemina- 
tion of SB in Europe. 


18.2.2.2 Austria 

A study on communicating SB from scientists via the media to the public was 
conducted by the Austrian COSY (Communicating Synthetic Biology) project in 
2008 [33]. Press releases written by the scientists on their work were reviewed by 
four journalists from major Austrian newspapers and magazines. The journalists 
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then wrote articles based on these materials, which were used as topics to be 
discussed by eight focus groups with member from the Austrian public. This 
study showed two important observations of science communication from the 
scientists via the media to the general public. The first observation indicated that 
journalists focused on and selected more for real-world applications of SB (phar- 
maceuticals, biofuels, etc.) than the abstract key scientific and engineering con- 
cepts (such as standardization, modularization, etc.). As a result the very key 
aspects that distinguish SB from GE were not disseminated properly to the pub- 
lic via the social media (by the journalists, in this case), and thus the laypeople 
could not see any difference between GE and SB, believing that SB was just 
another name for GE. The second observation concerned the relation between 
information and attitude. Before the participants got to read the articles, their 
opinion toward SB was neutral, neither positive nor negative. But after reading 
the articles, and partly due to the link made to GE, two groups became very nega- 
tive, and two groups became very positive toward SB, whereas four groups still 
had a rather neutral or uninterested opinion about SB (see Figure 18.1). So the 
assumption that “the more they know, they more they like it” cannot be estab- 
lished based on these empirical results. It turned out that the attitude toward SB 
would likely be polarized after more information was provided. The information 
in the focus groups was not taken in a neutral way, but rather the social identity 
of the individual would influence the revision of the early attitude. This might 
explain the polarization of attitudes toward SB in the Austrian laypeople. 

A media analysis on SB was done on the German-language media articles pub- 
lished between 2004 and 2008 [35]. It showed that the media reported about SB 
focused more on the positive potential and less on the risks. The definition of SB 
was introduced to the public along with possible applications. Journalists used 
common metaphors to define SB such as “biological engineer,’ “playing Lego,’ or 
“redesign of life; while the common phrases were related to the terms “machines,” 
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Figure 18.1 Focus group’s evaluation of SB, before and after they receive information. The 
x-axis means: —100 totally opposed; 0 neutral; +100 totally endorsing. Consequences of media 
information uptake and deliberation: focus groups’ symbolic coping with synthetic biology [34]. 
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showed again the way of science communication of the media on an emerging 
technology, which might not be able to disseminate the technology properly to 
the public due to lack of information on the key scientific issues while preferring 
common metaphors. 

A rather unusual way of trying to understand how lay people would react to SB 
was done in relation to one of the first SB art exhibitions. The exhibition “synth- 
ethic” showed 10 artworks from 10 international bio-artists in Vienna in May 
and June 2011 [36]. The artwork dealt with various aspects of SB, the use of 
biobricks not attempted by engineers, creation of protocells, modifying ecologi- 
cal networks, potential environmental release, the meaning of synthesis as 
opposed to analysis, etc (see http://www.biofaction.com/synth-ethic/). During 
the exhibition gallery visitors were interviewed about their perception of the 
artwork and the relation between art, science, and society. The results showed 
that people had little ethical problems with the artwork as long as it entailed the 
use and modification of lower life forms (bacteria, plants), but they were more 
concerned as the bio-objects moved up the evolutionary ladder toward birds, 
mammals, and even humans. An innate key concept seems to be the need to be 
able to keep different categories separated from each other, and any crossing of 
boundaries triggered uneasiness. Boundaries could be crossed in an ethical sense 
by modifying and designing mammals and humans or by crossing two different 
living entities (hybrids) or by crossing organisms and machines [37]. Any attempt 
to cross well-established boundaries of lay people’s naive view on biology could 
result in public resistance. 


18.2.2.3. Germany 

Three German research organizations—the German Research Foundation 
(DFG), the German Academy of Science and Engineering (acatech), and the 
German National Academy of Sciences (Leopoldina) — published a position arti- 
cle to outline their strategies to SB while suggesting a broadly based scientific 
and public debate on SB [38]. It suggested that SB would make major contribu- 
tions to the society while bearing risks, such as legal aspects, biosafety and bios- 
ecurity, commercial use, and ethics. While German scientists are active in the 
research field [39], the German public, similar to its smaller neighbor Austria, 
holds skeptic views on GMOs. It was feared that crossing “the boundaries 
between living matter and technically constructed matter” would cause public 
concern and that ethical boundaries were broken down as well. In their role as 
funding agencies in Germany, they proposed that the activities supported by 
public funding should “guarantee transparency by means of communication 
that will foster public acceptance of this research field” The ethical issues would 
need to be debated by the public further based on the seven hypotheses and 
goals [38]: 


The definition of life. 

The factors that determine the preconceived understanding of life. 

The description of entities. 

Moral arguments on applications of SB taking into consideration basic rights. 
Fundamental ethical objects against the applications of SB. 
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6. The debate on self-regulation of science. 
7. All the discussions would be on “a comprehensive, interdisciplinary and inter- 
contextual” basis. 


18.2.2.4 Netherlands 

The societal issues around SB were studied by the Netherlands Commission on 
Genetic Modification (COGEM). Their report published in 2010 analyzed the 
developments in SB and intended to answer the questions on when and how 
governments would have to anticipate the public debate on SB in order to prepare 
the future developments in the field [40]. In the Netherlands, the emergence of 
SB was subject to a public debate, which seemingly reiterated the old debates on 
biotechnology. 

High controversies are always raised for new technologies with high expecta- 
tions, where SB is no exception. It was believed that there was a gap between the 
technical expectations and the reality, as no concrete SB applications had yet 
reached the market. A problem identified was that at a time when the hype was 
dominated and reported by media, little information about specific societal 
implications was available, and later on when this information was available, the 
topic disappeared from the media and the public debate. 

The COGEM report concluded that thus “technology assessment needs to 
facilitate the societal-ethical debate when media attention, and thus the visibility 
of the technological developments, declines” It brought up a situation of how the 
public debates on SB should be conducted: the scientists speculated in the media 
about future developments in SB, and the media played “host to an exchange of 
‘dream’ and ‘doom’ scenarios.” It suggested that the gap between available infor- 
mation and hype-based media attention should be closed by using technology 
assessment to facilitate public debate. For example, a technology assessment was 
done on SB, pointing out new dimensions to old questions in public debates [41]. 
The issues identified were biosafety, misuse/bioterrorism, intellectual property, 
and ethics, in comparison between GM and SB. The challenges that SB raised 
were new questions and uncertainties about risks, difficulties in monitoring mis- 
use and research on potentially harmful organisms, new hurdles for research and 
innovations, and blurring boundary between life and machines. These issues and 
challenges should be primary topics for the public debates. Meanwhile, different 
technology/policy processes should be used in different stages of societal—ethi- 
cal discussions. At the early stage, the public debate would be initiated through 
the expectations articulated by the scientists. The introduction (mostly promises 
and expectations) of the emerging technology prompted the general public to 
form a perception. The real developments in the field—breakthrough or fail- 
ures—would prompt the public to revise what they were first told. During the 
growing stage of a technology, it was the achievement in the field that led to 
public debates. While the concrete application was absent, it was not easy for the 
government to address the possible issues in advance and steer developments 
accordingly. The public debates on this stage should have clear goals, with objec- 
tives either to steer the direction of the technological development or to gauge 
public support for the development or as an input to shape/support policy. 


18.2 Public Perception of the Nascent Field of Synthetic Biology 


18.2.2.5 United Kingdom 

In the United Kingdom, public perceptions on SB were studied by the Royal 
Academy of Engineering (RAE) and the Biotechnology and Biological Sciences 
Research Council (BBSRC) with input from the Engineering and Physical 
Sciences Research Council (EPSRC) in 2009. The report from RAE was based on 
a dialogue activity with 16 members of the public and a nationwide representa- 
tive survey of 1000 adults aged 18 and over [42]. The perceptions of laypeople 
about the scientific research, awareness, and understandings of SB were investi- 
gated. The report showed that in the United Kingdom the awareness of SB was 
low. This is similar to the findings in the United States and the rest of 
Europe -—nearly two thirds had never heard of SB, and for the one third who 
heard of SB, only 10% among this one third heard a lot (or which is 3% of the all 
answers of the survey), 57% a little (or 19% in total), and 33% only the term (or 
10% in total). The words linked to SB were “artificial,” “unnatural; and “man- 
made.’ While studying the public attitudes toward creating, modifying life, and 
totally man-made organisms, the majority of the respondents were positive 
about creating microorganisms to produce medicines and biofuels, which were 
also found in the Eurobarometer. Regarding issues around SB, the survey showed 
that there were biosafety concerns on SB applications involving environmental 
release, while there was comparatively little concern about the biosecurity issues 
SB might bring with. 

The BBSRC and EPSRC published a report outlining the most important find- 
ings around the Synthetic Biology Public Dialogue [43]. This dialogue was con- 
ducted by TNS-BMRB with 41 stakeholder interviews, involved 160 members of 
the public and specialists on science and governance. In the United Kingdom it 
is highly expected that SB could address some challenges for the whole society. 
Yet how to foster such a science should take into account the social context. The 
dialogue was conducted among the interested groups from the public, people 
from the research community and other stakeholders, to explore the public 
expectations, concerns, and aspirations around SB. The major findings from this 
dialogue were as follows: people were both excited and scared by the potential of 
SB; they were concerned about adequate regulations and preferred international 
regulations on SB, particularly for those applications that (might) affect the envi- 
ronment; and the public was concerned about the motivation of scientists who 
were asked to consider the wider impacts of their work. The UK dialogue revealed 
the important role of the public debate on SB and showed its impacts in dissemi- 
nation, awareness of the issues raised in the dialogue, and the needs for public 
engagement. 

The UK dialogue also showed the different views on SB from different stake- 
holders. For example, the researchers from the academic field tended to “rebrand” 
their research with SB to attract funding, while the researchers from the industry 
tended to avoid the SB label due to the negative perception of “synthetic” among 
the lay public. The social scientists, NGOs, and the consumer groups viewed that 
the development of SB was driven by the interest from the large corporations 
[44]. However, these different views did not hinder all the stakeholders to agree 
on the value of public engagement. A dialogue engaging all the stakeholders will 
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provide communication channels to build the responsible research in SB, which 
we will review in Section 18.4. 


18.2.3. Opinions from Concerned Civil Society Groups 


Attentions on SB are not limited within the academic community (social science, 
science and technology studies, technology assessment, etc.) and regulatory 
bodies. Concerned groups, especially environmental NGOs [45], for example, 
the Action Group on Erosion, Technology and Concentration (ETC), have con- 
ducted a couple of studies on SB and GE since 2006 [46-49]. In 2006, the ETC 
Group and other NGOs published an open letter calling for a societal debate on 
socioeconomic, security, health, environmental, and human rights implications 
of SB [50]. In one of their reports, they argued that the advocators of SB intended 
to “avoid public scrutiny by asserting that it is impossible to clearly distinguish 
their work from earlier advances in recombinant DNA technology (genetic engi- 
neering)” [47]. They recommended that public dialogue should be encouraged 
and the potential risks should be made transparent. While promoting SB could 
contribute to “the green economy,’ they argued that “a full global public debate 
on all of the socioeconomic, environmental and ethical issues related to biomass 
use, synthetic biology, and the governance of new and emerging technologies in 
general” was needed [51]. ETC together with other NGOs such as Friends of the 
Earth U.S. and International Center for Technology Assessment (ICTA) pub- 
lished the suggested principles for the oversight of SB. According to them, “full 
public participation at every level” should be included in the oversight of SB, and 
“full disclosure to the public of the nature of the synthetic organism” should be a 
prerequisite for commercialization or environmental releasing of any SB product 
(ETC et al. 2012). As recently as October 2012, ETC together with Friends of the 
Earth managed to get their concerns heard at the COP 11 UN meeting on the 
Convention on Biological Diversity (CBD) in Hyderabad, India. With 193 nation 
states represented, the representative of the Philippines asked for a moratorium 
on SB (initially suggested by the NGOs), which was then rejected by the other 
states. As a response to the critical views a final statement by all nations asking 
for a cautious approach to SB followed. The opinions from the concerned groups 
show the needs not only for public engagement but also for open access to the 
technology. These issues are key to the framework of RRI, which will be reviewed 
in Section 18.4. 


18.3. Frames and Comparators 


As we have shown in Section 18.2, comparisons between SB and GE are widely 
used when scientists communicate with their peers and with the public. Thus SB 
can be seen as GE 2.0. There are, however, strong indications that SB— in science 
and in the public debate — goes beyond a mere continuation of GE. Such debates 
are subject to dominant frames, because otherwise it would not be possible to 
discuss anything [52, 53]. For a development of a debate, it is necessary to develop 
a common understanding of what is to be considered relevant and which form of 
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argumentation is deemed legitimate. Without such an understanding, a debate is 
dysfunctional, and potential participants are unable to have a discussion in the 
first place [54, 55]. They do need to identify a common frame, under which the 
debate can be held [56]. 

The choice of a dominant frame does not determine the fate of a debate. But 
it has implications for the selection of relevant expertise, of the kind of stake- 
holders to be invited, of the type of measures to be taken, etc. For example, the 
debate about green biotechnology in Europe was mostly held under a risk frame, 
that is, arguments about risk for human health and the environment were 
deemed more relevant than economic equity or ethical concerns. Consequently, 
scientists were asked about the probability of risks, and prior risk assessment 
was made mandatory. In the stem cell debate, an ethics frame prevailed, and 
arguments over the sanctity of embryonic life were considered more important 
than health risks. The expertise taken on board in the negotiations included 
those of ethicists and clergymen, and measures included a ban on some forms of 
research. Yet another frame different from risk and ethics is the economic frame, 
emphasizing the opportunities for future benefits, growth, and opportunities 
for the economy. 

In principle, other frames might be conceivable. However empirically, in tech- 
nology debates they are most frequent: media analyses of technology controver- 
sies revealed “basic frames” that are not fundamentally different ones [57]. 

For an upstream debate on emerging technologies such as SB, dominant frames 
do not readily emerge from the issue itself, as this issue still is vague in its pro- 
perties and consequences. Analogies to other technologies having left a mark in 
the publics’ imagination come in handy here. The frames of the past debate on 
the older comparator technology influence those developing in the debate over 
the new technology. In practice, frames are often “copied” from a comparator 
debate and “pasted” into the new one: dominant arguments and the choice of 
issues relevant in the debate over the older technology serve as a blueprint for 
debating the implications of the new technology [56]. We might as well call it a 
“recombinant debate.” 

Many observers have expressed the assumption that SB would follow the same 
development as GE in the 1980s and 1990s, hence the word creation GE 2.0. SB, 
however, as a true interdisciplinary and converging technology has been linked 
not only to biotechnology but also to nanotechnology and information technol- 
ogy (IT) [56, 58]. Each comparator conveys different aspects, expectations, 
hopes, and fears; and the dominant debates are held under partly or entirely dif- 
ferent frames, respectively. Each comparator entails a unique way to understand 
and interpret the technology at stake. For biotechnology, the comparator stands 
for “technology as conflict”; in the case of nanotechnology, it is “technology as 
progress”; and for IT it is “technology as gadget.” The terms “conflict,” “progress,” 
and “gadget” are used here only to catch the main meaning of the frame in single 
term (see Figure 18.2). Thus, “If a comparator becomes dominant, i.e. obvious to 
many experts, stakeholders and members of the public it might influence the 
course of a debate ‘out there’ through suggesting one or more dominant frames. 
They will reflect the encompassing nature of the debate through their implicit 
conceptualization of the public: ‘technology as conflict’ goes along with the 
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Genetic engineering: technology as conflict 


Synthetic 
biology 


Nanotechnology: Information technology: 
technology as progress technology as gadget 


Figure 18.2 The dominant comparator for SB could come from either of three preexisting 
technology debates. 


public to be taken seriously; seen through the glasses of ‘technology as progress; 
the public appears as an entity to be mastered through appropriate means; and 
with ‘technology as gadget’ the public is seen as a player in the technology’s own 
team, so to say” [56]. 


18.3.1 Genetic Engineering: Technology as Conflict 


GM crops or “green” biotechnology have been subject to adverse public percep- 
tion in some countries. It is therefore not surprising that critical NGOs refer to 
GE as a comparator for SB [47], painting a dark picture with SB being internal- 
ized into the agenda of big business to exploit natural resources even more 
aggressively. The ETC Group dubbed SB to be “extreme genetic engineering” and 
underline the risks and inherent conflicts, such as intellectual property rights 
(IPR), economic and power concentrations, environmental safety, and rural live- 
lihoods. As a general rule, those environmental NGOs having addressed SB so 
far tended to extrapolate arguments against various forms of biotechnology to 
future applications of SB (ETC et al. 2012). 

Policy refers to the GE comparator mostly in the form of a menace: “the same” 
as with GM food (ie., a failed implementation due to public rejection) must be 
prevented. The IRGC report of 2010 (p. 37) described this reaction common 
among experts and policy makers as “... the ‘fear of the fear of the public’ — a 
concern among those working on synthetic biology that the kind of public 
response to GM crops that emerged in Europe in the late 1990s would be trans- 
ferred, perhaps in a more virulent form, to synthetic biology.” The problem, 
accordingly, lies in how to “... find ways of reconciling fundamentally conflicting 
values or ideologies.” “... there are strong differences of opinion at the outset of a 
debate, it is hard to manage the process in such a way as to avoid further polariza- 
tion of views and exacerbation of conflict” — exactly as with GM food [59]. 

Communication strategies by many scientists and those from the industry are 
to try to emphasize the difference to conventional biotechnology/GE. This may 
be related to presenting a promising new field to funding agencies. On the other 
hand, it has been pointed out that SB is an extension to GE transgressing past 
approaches but proceeding on the same avenue toward artificialness. In their 
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approach to the public, allusions to the biotechnology conflict in Europe can be 
found, although many prominent scientists come from the United States where 
biotechnology has not met with particular problems among the public. In 
Europe, in contrast to fears among policy makers, SB has not met with strong 
objections so far. A reason might be that it has not impinged on food, and food 
issues are used to be major conflict triggers not only regarding GE. 


18.3.2 Nanotechnology: Technology as Progress 


Nanotechnology is an emerging technology par excellence, bearing high expec- 
tations and benefiting from massive public funding — the EC alone, for example, 
spent €3.5 billion through the 7th Framework Programme (FP7). Regarding 
potential risk, both allegations and serious concerns have been addressed more 
professionally than with biotechnology in its early days. Assessments mostly 
resulted in identifying far-reaching knowledge gaps to be filled in incremen- 
tally but rapidly. In contrast to the perception of some technical experts and 
policy makers, press coverage has not particularly focused on risk so far; rather, 
the potentials for huge benefits have been mostly to the fore [60]. Despite many 
speculations that nanotechnology might elicit concerns similar to GMOs (and 
occasional demonstrations limited mainly to France), it succeeded to evade the 
public rejection trap. 

To address some negative speculations on nanotechnology, a variety of public 
engagement exercises have been set up (see e.g., [61]). Apart from more aca- 
demic social science research, information initiatives such as the “nanoTruck” in 
Germany, science fairs, and similar upstream outreach activities as well as a 
number of participatory events of different forms are belonging to a new way of 
successfully introducing a novel technology “in a responsible way.” Among other 
outcomes, this focus helped coin the term “responsible research and innovation” 
the EC subscribed to also for other technological areas [62], which will be further 
discussed in the next section. 


18.3.3 Information Technology: Technology as Gadget 


IT or computer technology changed our life over the last decades in an unprec- 
edented way. Few technologies had a similar impact on modern society. 
Computers govern virtually every aspect of our modern existence and cause an 
explosion in productivity. Initial resentments were overcome quickly, and IT has 
developed into a synonym for the most powerful, pervasive, and, at the same 
time, “cool” technology imaginable. Gadgets and toys galore have contributed to 
this image, and possessing the newest product has become the most relevant 
status symbol. There is a critical debate on the aspects such as intellectual prop- 
erty, privacy, or cybercrime, to name but a few, yet the technology as such is 
established beyond any question. 

SB can be considered as an IT too, only using a different medium, namely, 
DNA base sequences rather than software codes. Protagonists stress the IT anal- 
ogy to a remarkable extent, and many pertinent examples and apparent similari- 
ties between SB and IT appear in the literature. The analogies mostly refer to 
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elements of the technologies themselves even if they derive from entirely differ- 
ent disciplines such as electronic engineering. 

The closest link of SB and IT is established through the scientists and engi- 
neers involved -— many of the original protagonists in SB come from the IT sector. 
As part of their professional world view, they frequently allude to IT construc- 
tion elements such as integrated circuits, devices and systems, etc., when talking 
about biological entities such as genes, biological pathways, cells, and organisms. 
In addition, they decidedly set out to apply engineering principles in biology, 
which is also the most frequently used definition for SB. Even the formation of 
amateur biologists or DIYBio groups comes from a hacker tradition seen in the 
IT world. 

Using the IT frame as a dominant guide for assessing and debating the rami- 
fications of SB, the result — to a great extent —is a predominantly positive, cool, 
and gadget-like perception of SB. Yet it also calls for addressing safety and 
security concerns as well as intellectual property issues as those of IT. Thus, 
fostering responsibility in SB research should also be established alongside the 
development of the technology, which will be discussed in the following 
section. 


18.3.4 SB:Which Debate to Come? 


Since SB is still largely unknown by large parts of the public and contemporary 
debates are held mainly among experts, it is hard to tell which way the SB debate 
is going to play out. Will it develop along the lines of the old GE debate, as many 
environmental NGOs link it to? Or will the nanotechnology or IT comparator 
frame the debates? We are not aware of any hard facts to determine the future 
debate about SB. In the light of absence of such hard facts, some scholars inves- 
tigated artistic expression as a sense of possibility. 

A kind of sneak preview of the debate to come was presented by a study of 
independent SB short films [63]. The authors analyzed (semi-) fictional short 
films about SB that were shown during the Science, Art and Film Festival 
BIO-FICTION (see www.bio-fiction.com/videos). In this festival, filmmakers 
presented their visions of how SB would be taken up by society and their views 
through the short films. Since artists can to some extent be regarded as cultural 
psychologists, the depiction of SB in these science fiction/documentary films 
might as well help us to grasp the first hints of an SB debate to come. Going 
through the 52 short films from BIO-FICTION, the authors used the input to 
elaborate an analysis that comes to the conclusion that “representations of SB in 
the Bio:fiction films confirm with our hypothesis that the debate about SB is not 
seen as a straight continuation of the debate in biotechnology/genetic engineer- 
ing. Instead, alternative narrative attractors seem to be dominant. Although we 
were not able to make a clear case for either technology as progress or technol- 
ogy as gadget, since both aspects played out more or less equally, we could clearly 
reject the technology as conflict frame [63]. 

Analyzing the three main comparators of SB, it shows that SB goes beyond GE 
2.0, as indicating from the scientific/technological stance and the early indica- 
tions of public debates. To facilitate the development of SB and to leash the full 
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potential of SB, it calls for building a new framework for research — a framework 
for RRI. 


18.4 Toward Responsible Research and Innovation 
(RRI) in Synthetic Biology 


Implementing the RRI approach into SB can help to address the societal needs 
and challenges of the emerging technology. In the roadmap for SB in the United 
Kingdom, continuing RRI has been brought up as one of five core themes to 
achieve a successful outcome of SB in the United Kingdom [64]. 

RRI has been defined by the EC as “the comprehensive approach of proceeding 
in research and innovation in ways that allow all stakeholders that are involved in 
the processes of research and innovation at an early stage (A) to obtain relevant 
knowledge on the consequences of the outcomes of their actions and on the 
range of options open to them and (B) to effectively evaluate both outcomes and 
options in terms of societal needs and moral values and (C) to use these consid- 
erations (under A and B) as functional requirements for design and development 
of new research, products and services” [65]. 

Also the RRI approach should “be established as a collective, inclusive and 
system-wide approach.’ The RRI is considered as “a key pillar in the strategy of 
the European Union (EU) to create sustainable, inclusive growth and prosper- 
ity and address the societal challenges of Europe and the world” [65]. Its objec- 
tive is to address “the ethical concerns and societal needs in research and 
innovation,’ which can contribute to anchoring research and innovation (in the 
normative dimension), help to deliver the targets set out in Europe 2020 strat- 
egy (substantive dimension), and help to improve research administration 
(instruction dimension). In 2012, the EC issued a call for action plan for soci- 
etal challenges. And one of the special challenges they aimed for was the RRI in 
SB. It reasoned that although SB held many significant promises for the society, 
the public was not yet much aware of this nascent field and the associated 
regulatory challenges. Thus, “it is essential to establish open dialogue between 
stakeholders, to understand public concerns and ensure collaborative shaping 
of the field, aligned with societal needs and expectations” [66]. A dedicated 
project on RRI in SB has been funded by FP7 to establish an open dialogue 
between stakeholders concerning the potential benefits and risks of SB and to 
explore the possibilities for its collaborative shaping on the basis of public 
participation [67]. 

It is believed that RRI should be practiced continuously, which will help to 
ensure the awareness of potential issues and keep the regulatory frameworks up 
to date with progress in the field. SB is a nascent research field and RRI is a rela- 
tively new concept. It will, in no doubt, bring both challenges and opportunities 
to build an RRI framework for SB. The RRI concept has been promoted by the 
EU via funding schemes to encourage researchers from both natural science and 
social science to implement the concept into their research projects. 

Here, we will review what implications of RRI will bring into the practice of 
SB; explore the idea of RRI from several different angles, including engagement, 
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gender equality, science education, open access, ethics, and governance; and dis- 
cuss how this framework will be constructed. 


18.4.1 Engagement of All Societal Actors— Researchers, Industry, 
Policy Makers, and Civil Society — and Their Joint Participation 
in the Research and Innovation 


Engagement of all societal actors is key for RRI framework, which will help to 
bridge the gap between the scientific community and society at large. The 
European Group on Ethics in Science and New Technologies (EGE) published 
their opinion article on SB [68]. In this report, the philosophical, anthropologi- 
cal, ethical, legal, social, and scientific issues raised by SB were analyzed from 
the scientific aspects, legal, governance and policy aspects, and ethical aspects. 
It pointed out particularly the importance of public involvement and science— 
society dialogue. The European Academies Science Advisory Council also inves- 
tigated the scientific and governance implications of SB [69]. It, too, pointed out 
the importance of raising public awareness on the opportunities and challenges 
of SB among both the scientific community and with the public, as well as con- 
tinuous public dialogue to ensure that “endeavours in synthetic biology reflect 
wide public interests and aspirations.” Among the six recommendations pro- 
vided by the Working Group of Experts, one was on societal engagement, 
emphasizing the proactive approaches the research society had already applied 
to encourage and inform the public debate based on the accurate information. 
Dialogue among all the societal actors is a prerequisite to build a framework of 
RRI. Implementing RRI in SB would provide a unique stage for all societal actors 
to carry out the dialogues. 

A report from the Technology Strategy Board of the United Kingdom outlined 
the importance of RRI in SB particularly to its transition to industry applications. 
A responsible innovation framework would require ethical, societal, and regula- 
tory considerations both during the R&D process and during the commercial 
use. Throughout this process, all the stakeholders would have to get involved 
[70]. A newly funded project (SYNENERGENE) by the EC under the call for 
Science in Society will provide some insight how to such a framework to foster 
the growth of SB. This project will be conducted jointly by 27 partners around 
Europe, the United States, and Canada, which will bring together a wide range of 
scientists, regulators, NGOs, companies, and other stakeholders to act together 
to raise public awareness of SB and to get the stakeholders involved and encour- 
age public discourse and policy in an international context. 

As mentioned in the earlier sections, SB will have the potential to bring 
applications to the society, and people from different background would have 
different concerns on these applications. Thus, RRI aims to build “transparent, 
interactive processes in which societal actors and innovators become mutually 
responsive to each other with a view on the ethical acceptability, sustainability 
and societal desirability of the innovation process and its marketable products” 
[62], ideally bringing together societal actors with different interests and 
values to reach a consistent strategy for developing the technologies and their 
products. 
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In RRI the importance to engage all stakeholders, including the public, was 
emphasized [71]. By engaging all the social actors, it will help to build up public 
acceptance of innovation, to fulfill the government’s responsibility to give citi- 
zens opportunity to express their opinions, and to make sure public and civil 
society stakeholders are also co-players of research and innovation. The broad- 
spectrum public engagement would make research and innovation more effec- 
tive. The matured public perception on the technology will be important for 
future applications SB will develop. The advances of SB might make the knowl- 
edge and technology available to the amateur scientists, making them possible 
co-contributors. How to get the public involved is still a challenge, and the policy 
makers need to find solutions to make public involvement efficient and to assist 
the public to form their opinion. Developing proper models of SB to engage the 
societal actors should learn from the experience obtained from those of GE, 
nanotechnology, and IT (models of conflict, progress, and gadget). In an opinion 
paper by Nerlich and Mcleod, they argued that raising awareness on SB should 
be responsibly, in short, raising awareness of SB by responsible communication 
while comparing to the case study on climate change, the awareness of which 
should be advocated responsibly [72]. 


18.4.2 Gender Equality 


Gender equality is the second key issues for building the RRI framework. In 
the latest report from the EC on structural changes in research institutes, 
integrating a gender perspective has been considered as one of the key solu- 
tion to improve research in the EU [73]. Promoting gender equality in all levels 
contributes to research excellence and efficiency by making full use of a wider 
talent pool of human resource. The report brought up gender equality strategy 
(key steps) for actors at the EU, national and regional level, as well as to 
gatekeepers of scientific excellence and to universities and scientific institutes. 
For example, the EC should make gender requirements to all funding pro- 
grams; dedicated programs should be created to promote structural changes in 
research institutes; EU should set a good model at the worldwide level regard- 
ing gender issues; special unit for gender issues should be reestablished; a 
high-quality leadership development program should be created targeting 
experts; and researcher mobility measurement should incorporate gender 
dimension. 

Gender issues have already been studied by the SB society. The Sybhel project 
has studied how SB might influence the philosophical concepts of human health, 
which also involved gender aspects, and analyzed gender issues related to SB 
techniques in one of its work packages [74]. The ESRC Genomics Policy and 
Forum at the University of Edinburgh run a public engagement program on SB. 
A Democs card game on SB was designed, and playing Democs game was used 
as a resource to explore the public engagement of SB with lay publics in Scotland. 
In their report, the gender of the participants was analyzed in the feedback of 
the game [75]. However, neither these reports provided a comprehensive under- 
standing of the gender issues in SB. Thus dedicated projects are needed to 
address these issues. 
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18.4.3 Science Education 


The third key to RRI is science education [76]. Creative learning of fresh ideas 
will help enhance the current education process to ensure all societal actors can 
get relevant knowledge and tools to participate and make knowledge-based 
judgment in the process of research and innovation. The educational activity 
currently being explored are education initiatives that will promote “a culture of 
responsibility, participative inquiry, nuanced debate -starting in primary or high 
schools and including governments, scientists, businesses and civil society” [71]. 
The same report also suggested the roles different stakeholders should play to 
enhance sciences. Both the governments and research funders should foster in 
interdisciplinary cooperation and education. The consideration of ethical issues 
and societal needs should be addressed through education and training. This 
would prepare the societal actors better to anticipate ethical concerns and to 
take these concerns into consideration in the future R&D [65]. 

SB is a nascent research field and well known for its interdisciplinary nature. 
The novelty of SB calls for better science education — targeting not only the pub- 
lic at large but also the researchers from other disciplines. Meanwhile, activities 
around SB, such as the International Genetically Engineered Machine (iGEM) 
competitions and DIYBio movement, have already provided existing platforms 
for serving the education purposes. The iGEM competition is a worldwide SB 
annual competition. It initially aimed at undergraduate university students to 
promote their interest in this nascent field. But it has now expanded to include 
divisions for high school students and other interested groups outside the uni- 
versity setting [77]. DIYBio is a growing movement among amateur biologists. 
They are individuals, or small groups, who conduct biological research outside 
the conventional institutional setting (such as in academic or industrial facilities) 
with limited resources. Amateur biologists have little or no formal training in 
biology [78-82]. Both the iGEM and DIYBio movement will open up new educa- 
tion channels to the public. Both of them call for more supports from the profes- 
sional society and regulatory bodies to ensure the activities are conducted in 
efficient and beneficial way [83-88]. 

Among the many different ways of science education to engage nonscientists 
in science and technology issues, the use of science games comes handy. For 
example, BioFaction, as part of the European Science Foundation project on syn- 
thetic lantibiotics called “SYNMOD,’ developed a mobile app game to present 
the concept and aim of the project in an entertaining and accessible way (see 
Figure 18.3). 


18.4.4 Open Access 


SB is a fast-growing field that can be assigned broadly to the knowledge-based 
bioeconomy. Although SB is still in a nascent stage, the issues on open access 
have already raised concerns for potential future applications. A study was done 
to analyze the comparative benefits and pitfalls of open access and patenting 
issues [89]. As mentioned earlier in the chapter about frames and comparators, 
SB is also influenced by the IT sector. So it comes as no surprise that some ideas 
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Figure 18.3 The SYNMOD game app allows players to create and combine different peptide 
modules to design new antibiotics. The game is freely available for iOS X and Android devices. 
See http://www.biofaction.com/project/synmod-mobile-game/ 


and practices in SB are influenced by the IT world. One specific example is the 
uptake and translation of the open access/open source software to the world of 
biotechnology, a field where so far restrictive IPR have been used [90]. Unlike 
restricting access to crucial information, some synthetic biologists want to 
develop an open access to share the information they obtain. That is the the- 
ory behind the BioBricks Foundation (BBF), the International Open Facility 
Advancing Biotechnology (BioFab), the Biological Innovation for an Open 
Society (BIOS), and the Synthetic Biology Open Language (SBOL). 

BBF was founded by the scientists involved with the Registry of Standard 
Biological Parts, aiming to provide a platform to “ensure that the engineering of 
biology is conducted in an open and ethical manner to benefit all people and 
the planet” The Registry of Standard Biological Parts aims to allow interested 
actors to contribute and access standard genetic components, so-called parts 
and devices. Recently the BBF published the custom made BioBrick™ Public 
Agreement, which tries to set up a legal way to ensure open access [91]. 

BioFab was funded by the National Academy of Sciences (United States) to 
support an open technology platform and to provide free genetic constructs that 
can be customized for specific applications by academia and industry. 

BIOS was created to “enhance the transparency, accessibility and capability 
to use all the tools of science, whether patented, open access or public domain.” 
It is believed that the “open access to research” concept will not only increase 
the transparency in research but also promote free exchange of information. 
According to its proponents, such transparency will promote development by 
sharing knowledge among the research community and will help to reduce the 
misuse of the technology [89]. 

SBOL is an open source movement for in silico representation of genetic 
designs. SBOL is designed to allow electronic-like exchange designs, to send and 
retrieve genetic designs to and from the research centers, to facilitate storage of 
genetic designs, and to embed genetic designs in publications [92]. More and 
more bioparts have now been registered in the database. A registry software, the 
Joint BioEnergy Institute Inventory of Composable Elements (JBEI-ICEs), was 
created to provide a platform to manage the growing information on bioparts. 
The JBEI-ICE is built to support for distributed interconnected use and to pro- 
vide well-developed parts storage functionality for other SB software projects. 
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The open access approach demonstrates not only the willingness of the free 
flow of information among parts of the scientific community but also the 
demands from the public to secure the common benefit from the public-funded 
research. Thus making open access a reality is an important aspect to build the 
framework of RRI. The challenges for open access are basically twofold: firstly 
whether it will be sustainable and successfully picked up by “users” and secondly 
the legal issues of some open source content that (might) overlap with the exist- 
ing patents. 


18.4.5 Ethics 


Another key component for the RRI framework is ethics. The shared values 
among the European society call for RRI to be built respecting fundamental 
rights and the highest ethical standards [76]. As early as 2006, to stimulate the 
develop of SB in Europe, the EC funded 18 SB projects through NEST Pathfinder, 
aiming to stimulate advancements in science as well as to address ethical and 
safety concerns [93]. Among these projects, SYNBIOSAFE was particularly dedi- 
cated to study safety, ethical, and governance issues [94]. A number of other SB 
ethics-related projects funded by the EC came followed by SYNBIOSAFE later 
on. The EGE published their opinion article on SB [68]. Ethical issues raised by 
SB were analyzed by the EGE, including biosafety, biosecurity, justice, and intel- 
lectual property issues [68]. Twenty-six recommendations were proposed by 
the EGE in their opinion article regarding safety (environmental applications, 
sustainable energy, and healthcare products), security, governance, intellectual 
property (patent and justice), science and society aspects, and basic research. 

A recent study from EC pointed out that there were gaps between research and 
innovation systems and RRI regarding ethics. The research system failed to con- 
sider the ethical and societal aspect sufficiently, and the innovation system often 
failed to anticipate future societal needs. For both systems, the researchers were 
often less aware of the ethical and societal impacts of their research activities 
[65]. To integrate the ethical dimension into the research projects, the EU has 
asked the researchers to address the ethical questions and questions of social 
needs (if any) associated with their project in their grant applications and 
research projects. To further integrate research responsibility into the research 
projects, the expert group brought up an improved option other than the “busi- 
ness as usual” option: more research funding should be allocated (€79 billion for 
Horizon 2020 and €2.5 billion for COSME); and the researcher should reflect 
both ethics and responsibility in their proposals [65]. This option will require 
RRI to turn into the mainstream of the EU funding programs. The share of 
trans-/interdisciplinary research should be increased. Furthermore, a special 
funding should be set up dedicated to RRI research. The importance of ethical 
consideration in research has been also emphasized a UK study [64]. The funder 
BBSRC has placed a number of checks and balances to ensure the awareness of 
the ethical and social issues raised by the funded projects. Examples for the 
checks of ethical issues are ethical considerations on using animals in an experi- 
ment and the potential for misuse/dual use of the knowledge obtained from the 
projects. 


18.4 Toward Responsible Research and Innovation (RRI) in Synthetic Biology 


It is believed that guidance on responsible ethical assessment is needed to be 
vigilant about the harms of an emerging technology and prepared to revise the 
policy while necessary. This calls for a broad-based ethical framework for SB. A 
couple of key ethical principles relevant to the social implications of SB should be 
taken into consideration to evaluate SB and its potential risks and benefits, such 
as public beneficence, responsible stewardship, intellectual freedom and respon- 
sibility, democratic deliberation, and justice and fairness [95]. To apply this 
broad-based ethical framework to SB, public dialogue on ethical issues of SB is 
one of the key components. The model developed based on the frames and com- 
parators will be applied to these public dialogue events to provide the partici- 
pants accurate yet understandable information about the topics. 


18.4.6 Governance 


Harmonious models for RRI integrating public engagement, gender equality, sci- 
ence education, open access, and ethics can be built with proper governance. 
The policy makers are the ones who should take action [76]. To clarify the role of 
authority in regulation of SB, the European Academies Science Advisory Council 
investigated the scientific and governance implications of SB [69]. It is still in 
debate if specific policy for SB is needed to advance the field or this would create 
additional obstacles to the growth of the field. Already there have been govern- 
ance implications for biosafety and biosecurity, as it “remains an extension of 
recombinant DNA technology and the scientific community commits to devel- 
oping voluntary codes of conduct” [69]. The EC and member states should sup- 
port education and training programs of SB, while the societal and scientific 
community should be involved in the continuing debate to balance the self- 
governance and regulation. It was also suggested that the EC should build a 
robust governance framework and raise the governance issues internationally, 
particularly in the areas of research funding, ethics and human rights, and bios- 
ecurity, as well as trade and IPR [68]. It is believed that the right governance tools 
will help the responsible use of SB to promote scientific advances that would 
benefit the whole society and the environment. 

A report from a workshop organized by ERASynBio on public dialogue and 
governance suggested that governance of SB should be based particularly on 
three principles: participation, transparency, and accountability (see http://www. 
erasynbio.eu). These principles should then be implemented at all levels of the 
ERA-net-—from strategies to individual projects. These principles should be 
reflected in the calls and in the evaluation processes. The EC expert group 
provided opinions on how to implement RRI regarding to governance aspect (as 
listed in Box 18.1): 

To enable continuing RRI, the policy makers have called for collaboration with 
all stakeholders. This includes calling from funders on collaborative projects for 
researchers from the natural and social sciences. The convergence of both sci- 
ence aims to enhance both the scientific quality and the extent how social and 
ethical considerations are integrated [96]. The expert group of EC suggested that 
the societal stakeholders should be not only get involved in the projects but also 
get involved in the funding evaluation processes [65]. 
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Box 18.1 Options to implement RRI [65] 


e Applicants for EU research funds have to submit a statement on the ethical 
aspects of their research. This could be emphasized and applied more broadly. 
Additional guidance could be offered to applicants on the completion of this 
section. 

e Asking for a statement in each research proposal on how the research might 
contribute to addressing societal challenges (similar to the outline on the con- 
sideration of the Gender dimension) 

e The potential contributions to societal needs and the consideration of ethical 
aspects could become part of the selection criteria for research projects. So far, 
proposals are assessed against (i) scientific excellence, (ii) potential impacts 
(broadly defined), and (iii) management of the project. RRI aspects could be 
considered as a fourth aspect or a specification of the potential impacts. 


18.5 Conclusion 


SB is a nascent and innovative field of research with the potential to contribute 
to the whole society by addressing some of the challenges we are facing today, 
ranging from sustainable energy to green economy to environmental remedia- 
tion. The industrial potential is believed to be huge, and many scientists, politi- 
cians, and industrialists see SB as the key to the knowledge-based bioeconomy. 
Right now the public knows little about SB, and the public awareness of the field 
is growing at a slow speed as indicated by the studies on the public attitude 
toward SB in Europe and in the United States. With the European conflict on the 
use of GM crops still presents in many people’s minds, some fear that SB could 
run into similar problems, seeing SB as a mere GE 2.0, thus halting the develop- 
ment process. Other observers underline the interdisciplinary character of SB, 
pointing out that it might be a real converging technology where nanotechnol- 
ogy, IT, and biotechnology all come together. What is true for the technological 
convergence could also be true for the public debate about the technology, of 
which we have elucidated in the frames used in the public debate and the com- 
parators of SB. Since SB is as much a GE as it is a form of nanotechnology and a 
form of IT, the respective public image will influence the developing public per- 
ception of SB. While GE contributes with the “technology as conflict” compara- 
tor, nanotechnology is represented in the notion of “technology as progress,’ and 
IT is “technology as gadget.’ As a consequence, SB might not be a GE 2.0 but 
something different, something new and unique. 

Findings from the studies on the public perception of SB as well as the analysis 
on the frames and comparators of SB call for an innovation approach to address 
these issues. SB, at least in Europe, should be developed along with an attempt to 
promote a contemporary approach to technological development—RRI. This 
approach takes up issues of stakeholder participation, science education, gender 
equality, open access, ethics, and governance and can be seen as a comprehen- 
sive approach to deal with novel technologies in an environment that does not 
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automatically praise every scientific development per se and a priori as great, but 
asks how the technology could not only support the (economy) but actually 
benefit people and the environment. 

Learning from the lesson of the past, applying the RRI approach in SB is critical 
to facilitate the social benefits of an emerging technology, to avoid raising social 
resistance to a technological advance that does not benefit people, and to sup- 
port trust in research and innovation. More activities to get the public involved 
should be encouraged, while novel models should be built for open dialogues on 
SB. One of our ongoing projects —SYNENERGENE — may provide such models 
by setting up six platforms to tackle issues and challenges on SB of future of the 
field, public science and participation, art, culture and society, research and pol- 
icy, international dimension, and online communication. We also expect to see 
more activities on turning SB into an RRI in the coming years. 
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